让我们拿出一些更有趣的数据.
set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df
# patient cancer blood_pres low_education
# 1 1 ok risk risk
# 2 2 ok risk risk
# 3 3 ok ok ok
# 4 4 risk risk risk
# 5 5 ok ok risk
# 6 6 risk risk ok
# 7 7 ok ok ok
# 8 8 ok risk ok
# 9 9 ok ok ok
# 10 10 risk risk risk
从这里开始,我们将旋转、汇总,然后连接回原始数据.
library(dplyr)
library(tidyr) # pivot_*
df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(howmany = n(), risks = toString(name), .by = patient) %>%
left_join(df, ., by = "patient") %>%
mutate(howmany = coalesce(howmany, 0))
# patient cancer blood_pres low_education howmany risks
# 1 1 ok risk risk 2 blood_pres, low_education
# 2 2 ok risk risk 2 blood_pres, low_education
# 3 3 ok ok ok 0 <NA>
# 4 4 risk risk risk 3 cancer, blood_pres, low_education
# 5 5 ok ok risk 1 low_education
# 6 6 risk risk ok 2 cancer, blood_pres
# 7 7 ok ok ok 0 <NA>
# 8 8 ok risk ok 1 blood_pres
# 9 9 ok ok ok 0 <NA>
# 10 10 risk risk risk 3 cancer, blood_pres, low_education
(请注意,使用.by=
需要dplyr_1.1.0
或更高版本.如果您使用的是较旧的dplyr,并且不会更新,请改用group_by(patient)
而不是.by=patient
.)
您可能需要考虑一件事:除非这仅用于表示表,否则将risks
作为列表列而不是逗号分隔的字符串有时是有利的.要做到这一点,只需将toString
替换为list
,虽然它在控制台上可能是render,但它将允许在它上执行类似设置操作的操作(尽管正常的列/向量操作可能不会像您预期的那样工作):
out <- df %>%
pivot_longer(cols = -patient, values_to = "risk") %>%
filter(risk == "risk") %>%
summarize(howmany = n(), risks = list(name), .by = patient) %>%
left_join(df, ., by = "patient") %>%
mutate(howmany = coalesce(howmany, 0))
out
# patient cancer blood_pres low_education howmany risks
# 1 1 ok risk risk 2 blood_pres, low_education
# 2 2 ok risk risk 2 blood_pres, low_education
# 3 3 ok ok ok 0 NULL
# 4 4 risk risk risk 3 cancer, blood_pres, low_education
# 5 5 ok ok risk 1 low_education
# 6 6 risk risk ok 2 cancer, blood_pres
# 7 7 ok ok ok 0 NULL
# 8 8 ok risk ok 1 blood_pres
# 9 9 ok ok ok 0 NULL
# 10 10 risk risk risk 3 cancer, blood_pres, low_education
如果此数据是Tibble(tbl_df
),则相同的数据将显示为
tibble(out)
# # A tibble: 10 × 6
# patient cancer blood_pres low_education howmany risks
# <int> <chr> <chr> <chr> <dbl> <list>
# 1 1 ok risk risk 2 <chr [2]>
# 2 2 ok risk risk 2 <chr [2]>
# 3 3 ok ok ok 0 <NULL>
# 4 4 risk risk risk 3 <chr [3]>
# 5 5 ok ok risk 1 <chr [1]>
# 6 6 risk risk ok 2 <chr [2]>
# 7 7 ok ok ok 0 <NULL>
# 8 8 ok risk ok 1 <chr [1]>
# 9 9 ok ok ok 0 <NULL>
# 10 10 risk risk risk 3 <chr [3]>
我们可以直接做一些事情,比如判断该列中每一行的长度;或者快速判断精确的集合成员:
lengths(out$risks)
# [1] 2 2 0 3 1 2 0 1 0 3
sapply(out$risks, `%in%`, x = "cancer")
# [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
诚然,这两个都可以用正则表达式来完成,但是..如果名称有任何模棱两可的地方,正则表达式会带来一些开销.