TL;DR: summarize(across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)))需要一个vars_to_sumnon NA列表,而我在函数内部工作,用户有时会要求没有变量被求和.summarize()包括其他across(),我不确定它是由一个简单的if()解决.

至少我认为问题是...


嗨!

我正在从事一个涉及相当多的数据管理/工程的项目,其中有一个步骤让我很吃力.我将把这个问题复制到虚假数据上.

我在处理住院期间的事情.我从关系数据库开始,然后慢慢地以一种适合我需要的格式汇总信息.在我的问题开始的地方,数据将如下所示:

df <- data.frame(
patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
录入日期 = as.POSIXt("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
退出日期 = as.POSIXt("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
drug_A = c(0,0,1,1,0,1,0,0),
drug_B = c(0,1,1,0,0,0,1,1),
drug_C = c(1,2,5,1,0,0,1,0)
)

N.B:数据库将包含更多的drug_xxx个变量

我在这里的目标是开发一种功能,

(1)找出在短时间内住过几次院的患者(在另一次离开后不久开始住院,也存在N.B:个重叠的住院日),因为我们会认为所有这样的住院日都与同一医疗问题有关(假设心脏骤停).如果有几家医院/病房/doctor /...连续几天通知同一个人的心脏骤停,将被视为一次心脏骤停,然后是几次停留/会诊/...都与同一次心脏骤停有关.

(2)汇总附近停留的信息,最大限度地增加变量(由用户定义),汇总应该汇总的内容,在适当的时候考虑最早或最新的数据点,其余的保持不变

该函数将考虑以下参数:

  • DF,一种数据帧
  • One_id,指示标识个人的变量的名称的字符串
  • 录入日期
  • 退出日期
  • allowed_dist,一次住宿的离开日期和另一次住宿的进入日期之间的最大距离
  • vars_to_max,要最大化的变量的名称
  • Vars_to_sum,要求和的变量名称
  • Vars_to_est,感兴趣的最早信息的变量名称
  • Vars_to_Latest,感兴趣的最新信息的变量名称

理想情况下,即使var_to_xxx参数为NA,该函数也应该可以工作,这意味着在这种情况下没有变量可以以这种方式求和.

[编辑:我最初分享了我的函数函数,混合了dplyr和for-loop,但我认为它只会降低POST的可读性,所以这里有一个几乎可以工作但需要修复的函数]

aggregate_stays <- function (df, individual_id, date_entry, date_exit, allowed_dist=0, vars_to_earliest=NA, vars_to_latest=NA, vars_to_sum=NA, vars_to_max=NA){
  
  require("lubridate",quietly=TRUE)
  require("dplyr",quietly=TRUE)
  
  # Distinguishing between single stays and multiple stays
  df_single_stays <- df %>%
    group_by(!!sym(individual_id)) %>%
    filter(n() == 1) %>%
    ungroup()
  df_multiple_stays <- df %>%
    group_by(!!sym(individual_id)) %>%
    filter(n() > 1) %>%
    ungroup()
  
  # Summarizing multiple stays according to the arguments of the function
  df_multiple_stays <- df_multiple_stays %>%
    arrange(!!sym(individual_id), !!sym(date_entry)) %>% # Ordering the individuals and stays
    group_by(!!sym(individual_id)) %>%
    mutate(wholestay_id = cumsum(!!sym(date_entry) - lag(!!sym(date_exit), default = first(!!sym(date_exit))) > allowed_dist)) %>% # Checking if nearby stays
    group_by(!!sym(individual_id), wholestay_id) %>%
    mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
    ungroup() %>%
    group_by(wholestay_id) %>%
    summarize(wholestay_entry = min(!!sym(date_entry)),
              wholestay_exit = max(!!sym(date_exit)),
              across(all_of(vars_to_earliest), ~ .[which.min(!!sym(date_entry))]),
              across(all_of(vars_to_latest), ~ .[which.max(!!sym(date_exit))]),
              across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)),
              across(all_of(vars_to_max), ~ max(., na.rm=TRUE))) %>%
    ungroup()
  
  
  # Removing the `wholestay_id` variable that was not asked by user
  df_multiple_stays <- df_multiple_stays %>%
    dplyr::select(-wholestay_id)
  
  df_final <- bind_rows(df_multiple_stays,df_single_stays)
  
  return(df_final)
}

但它似乎不喜欢将NA作为vars_to_xxx参数的参数

df2 <- aggregate_stays_wip(df = df,
                           individual_id = "patient_id",
                           date_entry = "录入日期",
                           date_exit = "退出日期",
                           allowed_dist = 10,
                           vars_to_sum = "drug_C",
                           vars_to_max = c("drug_A","drug_B"),
                           vars_to_earliest = "entry_mode",
                           vars_to_latest = c("patient_id","exit_mode"))

工作正常

df3 <- aggregate_stays(df = df,
                       individual_id = "patient_id",
                       date_entry = "录入日期",
                       date_exit = "退出日期",
                       allowed_dist = 10,
                       vars_to_sum = NA,
                       vars_to_max = c("drug_A","drug_B","drug_C"),
                       vars_to_earliest = "entry_mode",
                       vars_to_latest = c("patient_id","exit_mode"))

返回

Error in `summarize()`:
ℹ In argument: `across(all_of(vars_to_sum), ~sum(., na.rm = TRUE))`.
Caused by error in `across()`:
! Selections can't have missing values.
Run `rlang::last_trace()` to see where the error occurred.

如果你有什么办法解决这个问题,我很乐意听听!

最好的 一个流行病学家,他希望自己在编码方面做得更好.

推荐答案

而是传递NULL,然后它跳过汇总调用的该行.

您还可以通过将NULL作为默认值传递并使用整齐的计算来避免进行引用,从而使事情变得更容易:

library(tidyverse)

aggregate_things <- function(
      data,
      vars_to_sum = NULL,
      vars_to_mean = NULL,
      vars_to_max = NULL
) {
  
  data |> 
    summarise(
      across({{vars_to_sum}}, sum),
      across({{vars_to_mean}}, mean),
      across({{vars_to_max}}, max),
    )
}

iris |> 
  aggregate_things(vars_to_sum = c(Sepal.Length, Sepal.Width),
                   vars_to_max = c(Petal.Length, Petal.Width))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1        876.5       458.6          6.9         2.5

Modifying your code to take NULLs

library(tidyverse)

df <- data.frame(
  patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
  entry_date = ymd("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
  exit_date = ymd("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
  entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
  exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
  drug_A = c("0","0","1","1","0","1","0","0"),
  drug_B = c("0","1","1","0","0","0","1","1"),
  drug_C = c("1","2","5","1","0","0","1","0")
) |> 
  mutate(across(starts_with("drug"), as.integer))


aggregate_stays <-
  function (df,
            individual_id,
            date_entry,
            date_exit,
            allowed_dist = 0,
            vars_to_earliest = NULL,
            vars_to_latest = NULL,
            vars_to_sum = NULL,
            vars_to_max = NULL) {
    
    require("lubridate",quietly=TRUE)
    require("dplyr",quietly=TRUE)

  # Distinguishing between single stays and multiple stays
  df_single_stays <- df %>%
    group_by({{individual_id}}) %>%
    filter(n() == 1) %>%
    ungroup()
  
  df_multiple_stays <- df %>%
    group_by({{individual_id}}) %>%
    filter(n() > 1) %>%
    ungroup()

  # Summarizing multiple stays according to the arguments of the function
  df_multiple_stays <- df_multiple_stays %>%
    arrange({{individual_id}}, {{date_entry}}) %>% # Ordering the individuals and stays
    group_by({{individual_id}}) %>%
    mutate(wholestay_id = cumsum({{date_entry}} - lag({{date_exit}}, default = first({{date_exit}})) > allowed_dist)) %>% # Checking if nearby stays
    group_by({{individual_id}}, wholestay_id) %>%
    mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
    ungroup() %>%
    group_by(wholestay_id) %>%
    summarize(wholestay_entry = min({{date_entry}}),
              wholestay_exit = max({{date_exit}}),
              across({{vars_to_earliest}}, ~ .[which.min({{date_entry}})]),
              across({{vars_to_latest}}, ~ .[which.max({{date_exit}})]),
              across({{vars_to_sum}}, ~ sum(., na.rm=TRUE)),
              across({{vars_to_max}}, ~ max(., na.rm=TRUE))) %>%
    ungroup()
  
  
  # Removing the `wholestay_id` variable that was not asked by user
  df_multiple_stays <- df_multiple_stays %>%
    dplyr::select(-wholestay_id)
  
  df_final <- bind_rows(df_multiple_stays,df_single_stays)
  
  return(df_final)
}

aggregate_stays(df = df,
                    individual_id = patient_id,
                    date_entry = entry_date,
                    date_exit = exit_date,
                    allowed_dist = 10,
                    vars_to_sum = drug_C,
                    vars_to_max = c(drug_A,drug_B),
                    vars_to_earliest = entry_mode,
                    vars_to_latest = c(patient_id,exit_mode))
#> # A tibble: 5 × 10
#>   wholestay_entry wholestay_exit entry_mode patient_id exit_mode  drug_C drug_A
#>   <date>          <date>         <chr>      <chr>      <chr>       <int>  <int>
#> 1 2020-02-01      2020-02-10     home       Bryan      home            7      1
#> 2 2020-03-01      2020-03-10     home       Charlotte  discharged      1      1
#> 3 2020-04-01      2020-04-10     home       Charlotte  home            0      0
#> 4 2020-05-01      2020-06-10     transfer   Denis      transfer        1      1
#> 5 NA              NA             home       Anne       death           1      0
#> # ℹ 3 more variables: drug_B <int>, entry_date <date>, exit_date <date>

aggregate_stays(df = df,
                individual_id = patient_id,
                date_entry = entry_date,
                date_exit = exit_date,
                allowed_dist = 10,
                vars_to_max = c(drug_A,drug_B,drug_C),
                vars_to_earliest = entry_mode,
                vars_to_latest = c(patient_id,exit_mode))
#> # A tibble: 5 × 10
#>   wholestay_entry wholestay_exit entry_mode patient_id exit_mode  drug_A drug_B
#>   <date>          <date>         <chr>      <chr>      <chr>       <int>  <int>
#> 1 2020-02-01      2020-02-10     home       Bryan      home            1      1
#> 2 2020-03-01      2020-03-10     home       Charlotte  discharged      1      0
#> 3 2020-04-01      2020-04-10     home       Charlotte  home            0      0
#> 4 2020-05-01      2020-06-10     transfer   Denis      transfer        1      1
#> 5 NA              NA             home       Anne       death           0      0
#> # ℹ 3 more variables: drug_C <int>, entry_date <date>, exit_date <date>

这现在应该获取您的数据并提供预期的结果,另外一个好处是,如果您将数据帧通过管道传递到RStudio中的函数中,那么Tab自动完成将为您查找数据帧中的列,并将它们作为不带引号的列名传递.

R相关问答推荐

基于R中的GPS点用方向箭头替换点

工作流程_set带有Dplyrr风格的 Select 器,用于 Select 结果和预测因子R

在R底座中更改白天和夜晚的背景 colored颜色

在(g)子中使用asserable字符

如何在R中添加截止点到ROC曲线图?

更改默认系列1以更改名称

如何将移除事件分配给动态创建的按钮?

使用tidy—select创建一个新的带有mutate的摘要变量

将包含卷的底部25%的组拆分为2行

非线性混合效应模型(NLME)预测变量的置信区间

R函数,用于生成伪随机二进制序列,其中同一数字在一行中不出现超过两次

计算满足R中条件的连续列

plotly hover文本/工具提示在shiny 中不起作用

在R中使用列表(作为tibble列)进行向量化?

如何显示准确的p值而不是<;0.001*?

如何使投篮在R中保持一致

有没有办法将勾选/审查标记添加到R中的累积关联图中?

如何在一个GGPLATE中绘制多个灰度平滑?

如何将一列相关性转换为R中的相关性矩阵

R中的交叉表