TL;DR: summarize(across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)))
需要一个vars_to_sum
的non NA列表,而我在函数内部工作,用户有时会要求没有变量被求和.summarize()
包括其他across()
,我不确定它是由一个简单的if()
解决.
至少我认为问题是...
嗨!
我正在从事一个涉及相当多的数据管理/工程的项目,其中有一个步骤让我很吃力.我将把这个问题复制到虚假数据上.
我在处理住院期间的事情.我从关系数据库开始,然后慢慢地以一种适合我需要的格式汇总信息.在我的问题开始的地方,数据将如下所示:
df <- data.frame(
patient_id = c("Anne","Bryan","Bryan","Charlotte","Charlotte","Denis","Denis","Denis"),
录入日期 = as.POSIXt("2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-04-01", "2020-05-01", "2020-05-05", "2020-05-25"),
退出日期 = as.POSIXt("2020-01-10", "2020-02-02", "2020-02-10", "2020-03-10", "2020-04-10", "2020-05-02", "2020-05-20", "2020-06-10"),
entry_mode = c("home","home","transfer","home","home","transfer",NA,"transfer"),
exit_mode = c("death","transfer","home","discharged","home","death",NA,"transfer"),
drug_A = c(0,0,1,1,0,1,0,0),
drug_B = c(0,1,1,0,0,0,1,1),
drug_C = c(1,2,5,1,0,0,1,0)
)
N.B:数据库将包含更多的drug_xxx
个变量
我在这里的目标是开发一种功能,
(1)找出在短时间内住过几次院的患者(在另一次离开后不久开始住院,也存在N.B:个重叠的住院日),因为我们会认为所有这样的住院日都与同一医疗问题有关(假设心脏骤停).如果有几家医院/病房/doctor /...连续几天通知同一个人的心脏骤停,将被视为一次心脏骤停,然后是几次停留/会诊/...都与同一次心脏骤停有关.
(2)汇总附近停留的信息,最大限度地增加变量(由用户定义),汇总应该汇总的内容,在适当的时候考虑最早或最新的数据点,其余的保持不变
该函数将考虑以下参数:
- DF,一种数据帧
- One_id,指示标识个人的变量的名称的字符串
- 录入日期
- 退出日期
- allowed_dist,一次住宿的离开日期和另一次住宿的进入日期之间的最大距离
- vars_to_max,要最大化的变量的名称
- Vars_to_sum,要求和的变量名称
- Vars_to_est,感兴趣的最早信息的变量名称
- Vars_to_Latest,感兴趣的最新信息的变量名称
理想情况下,即使var_to_xxx参数为NA,该函数也应该可以工作,这意味着在这种情况下没有变量可以以这种方式求和.
[编辑:我最初分享了我的函数函数,混合了dplyr和for-loop,但我认为它只会降低POST的可读性,所以这里有一个几乎可以工作但需要修复的函数]
aggregate_stays <- function (df, individual_id, date_entry, date_exit, allowed_dist=0, vars_to_earliest=NA, vars_to_latest=NA, vars_to_sum=NA, vars_to_max=NA){
require("lubridate",quietly=TRUE)
require("dplyr",quietly=TRUE)
# Distinguishing between single stays and multiple stays
df_single_stays <- df %>%
group_by(!!sym(individual_id)) %>%
filter(n() == 1) %>%
ungroup()
df_multiple_stays <- df %>%
group_by(!!sym(individual_id)) %>%
filter(n() > 1) %>%
ungroup()
# Summarizing multiple stays according to the arguments of the function
df_multiple_stays <- df_multiple_stays %>%
arrange(!!sym(individual_id), !!sym(date_entry)) %>% # Ordering the individuals and stays
group_by(!!sym(individual_id)) %>%
mutate(wholestay_id = cumsum(!!sym(date_entry) - lag(!!sym(date_exit), default = first(!!sym(date_exit))) > allowed_dist)) %>% # Checking if nearby stays
group_by(!!sym(individual_id), wholestay_id) %>%
mutate(wholestay_id = cur_group_id()) %>% # Creating the index of "big stays"
ungroup() %>%
group_by(wholestay_id) %>%
summarize(wholestay_entry = min(!!sym(date_entry)),
wholestay_exit = max(!!sym(date_exit)),
across(all_of(vars_to_earliest), ~ .[which.min(!!sym(date_entry))]),
across(all_of(vars_to_latest), ~ .[which.max(!!sym(date_exit))]),
across(all_of(vars_to_sum), ~ sum(., na.rm=TRUE)),
across(all_of(vars_to_max), ~ max(., na.rm=TRUE))) %>%
ungroup()
# Removing the `wholestay_id` variable that was not asked by user
df_multiple_stays <- df_multiple_stays %>%
dplyr::select(-wholestay_id)
df_final <- bind_rows(df_multiple_stays,df_single_stays)
return(df_final)
}
但它似乎不喜欢将NA作为vars_to_xxx参数的参数
df2 <- aggregate_stays_wip(df = df,
individual_id = "patient_id",
date_entry = "录入日期",
date_exit = "退出日期",
allowed_dist = 10,
vars_to_sum = "drug_C",
vars_to_max = c("drug_A","drug_B"),
vars_to_earliest = "entry_mode",
vars_to_latest = c("patient_id","exit_mode"))
工作正常
但
df3 <- aggregate_stays(df = df,
individual_id = "patient_id",
date_entry = "录入日期",
date_exit = "退出日期",
allowed_dist = 10,
vars_to_sum = NA,
vars_to_max = c("drug_A","drug_B","drug_C"),
vars_to_earliest = "entry_mode",
vars_to_latest = c("patient_id","exit_mode"))
返回
Error in `summarize()`:
ℹ In argument: `across(all_of(vars_to_sum), ~sum(., na.rm = TRUE))`.
Caused by error in `across()`:
! Selections can't have missing values.
Run `rlang::last_trace()` to see where the error occurred.
如果你有什么办法解决这个问题,我很乐意听听!
最好的 一个流行病学家,他希望自己在编码方面做得更好.