假设我在心理学领域工作,我想知道一个病人有多少危险因素.在此之后,我想列出所有的风险,然后发现最普遍的风险(模式).我在考虑使用mutate,然后paste0,如果行的值是"风险",则得到colname.不过,我很难接受. 任何帮助都很感激.

risk

代码如下:

library(tidyverse)
df = data.frame(
  patient = seq(1:60),
  cancer = c("risk","ok"), 
  blood_pres = c("risk", "ok"),
  low_education = c("risk","ok")
) 

df = df %>% mutate(how_many_risks =
                     rowSums(. == "risk"))

推荐答案

让我们拿出一些更有趣的数据.

set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df
#    patient cancer blood_pres low_education
# 1        1     ok       risk          risk
# 2        2     ok       risk          risk
# 3        3     ok         ok            ok
# 4        4   risk       risk          risk
# 5        5     ok         ok          risk
# 6        6   risk       risk            ok
# 7        7     ok         ok            ok
# 8        8     ok       risk            ok
# 9        9     ok         ok            ok
# 10      10   risk       risk          risk

从这里开始,我们将旋转、汇总,然后连接回原始数据.

library(dplyr)
library(tidyr) # pivot_*
df %>%
  pivot_longer(cols = -patient, values_to = "risk") %>%
  filter(risk == "risk") %>%
  summarize(howmany = n(), risks = toString(name), .by = patient) %>%
  left_join(df, ., by = "patient") %>%
  mutate(howmany = coalesce(howmany, 0))
#    patient cancer blood_pres low_education howmany                             risks
# 1        1     ok       risk          risk       2         blood_pres, low_education
# 2        2     ok       risk          risk       2         blood_pres, low_education
# 3        3     ok         ok            ok       0                              <NA>
# 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
# 5        5     ok         ok          risk       1                     low_education
# 6        6   risk       risk            ok       2                cancer, blood_pres
# 7        7     ok         ok            ok       0                              <NA>
# 8        8     ok       risk            ok       1                        blood_pres
# 9        9     ok         ok            ok       0                              <NA>
# 10      10   risk       risk          risk       3 cancer, blood_pres, low_education

(请注意,使用.by=需要dplyr_1.1.0或更高版本.如果您使用的是较旧的dplyr,并且不会更新,请改用group_by(patient)而不是.by=patient.)

您可能需要考虑一件事:除非这仅用于表示表,否则将risks作为列表列而不是逗号分隔的字符串有时是有利的.要做到这一点,只需将toString替换为list,虽然它在控制台上可能是render,但它将允许在它上执行类似设置操作的操作(尽管正常的列/向量操作可能不会像您预期的那样工作):

out <- df %>%
  pivot_longer(cols = -patient, values_to = "risk") %>%
  filter(risk == "risk") %>%
  summarize(howmany = n(), risks = list(name), .by = patient) %>%
  left_join(df, ., by = "patient") %>%
  mutate(howmany = coalesce(howmany, 0))
out
#    patient cancer blood_pres low_education howmany                             risks
# 1        1     ok       risk          risk       2         blood_pres, low_education
# 2        2     ok       risk          risk       2         blood_pres, low_education
# 3        3     ok         ok            ok       0                              NULL
# 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
# 5        5     ok         ok          risk       1                     low_education
# 6        6   risk       risk            ok       2                cancer, blood_pres
# 7        7     ok         ok            ok       0                              NULL
# 8        8     ok       risk            ok       1                        blood_pres
# 9        9     ok         ok            ok       0                              NULL
# 10      10   risk       risk          risk       3 cancer, blood_pres, low_education

如果此数据是Tibble(tbl_df),则相同的数据将显示为

tibble(out)
# # A tibble: 10 × 6
#    patient cancer blood_pres low_education howmany risks    
#      <int> <chr>  <chr>      <chr>           <dbl> <list>   
#  1       1 ok     risk       risk                2 <chr [2]>
#  2       2 ok     risk       risk                2 <chr [2]>
#  3       3 ok     ok         ok                  0 <NULL>   
#  4       4 risk   risk       risk                3 <chr [3]>
#  5       5 ok     ok         risk                1 <chr [1]>
#  6       6 risk   risk       ok                  2 <chr [2]>
#  7       7 ok     ok         ok                  0 <NULL>   
#  8       8 ok     risk       ok                  1 <chr [1]>
#  9       9 ok     ok         ok                  0 <NULL>   
# 10      10 risk   risk       risk                3 <chr [3]>

我们可以直接做一些事情,比如判断该列中每一行的长度;或者快速判断精确的集合成员:

lengths(out$risks)
#  [1] 2 2 0 3 1 2 0 1 0 3

sapply(out$risks, `%in%`, x = "cancer")
#  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

诚然,这两个都可以用正则表达式来完成,但是..如果名称有任何模棱两可的地方,正则表达式会带来一些开销.

R相关问答推荐

根据固定值范围在tible中添加新行

编码变量a、b、c以匹配来自另一个数据点的变量x

行式dppr中的变量列名

用值序列对行进行子集化,并标识序列开始的列

如何自定义3D散点图的图例顺序?

为什么当用osmdata映射R时会得到相邻状态?

多个模拟序列间的一种预测回归关系

plotly hover文本/工具提示在shiny 中不起作用

将重复项转换为NA

计算两列中满足特定条件连续行之间的平均值

在GG图中绘制射线的自动程序

如何使用For-R循环在向量中找到一系列数字

如何筛选截止年份之前最后一个测量年度的所有观测值以及截止年份之后所有年份的所有观测值

随机 Select 的非NA列的行均数

在r中整理图例和堆叠图的问题

在使用SliderInput In Shiny(R)设置输入数据的子集时,保留一些情节痕迹

为什么将负值向量提升到分数次方会得到NaN

R try Catch in the loop-跳过缺少的值并创建一个DF,显示跳过的内容

如何编辑被动式数据表?

随机生成样本,同时在R内的随机样本中至少包含一次所有值