如下面的示例所示,我有一个数据表(让我们称其为"Current"),它有多个行,对应3个不同的人,每个人都有一个日期和一个分类事件.我们可以假设我已经按时间顺序对它进行了排序,如下所示,但这可能无关紧要.

我想为每一行的每个分类事件类型提供单独的列,这些列将每个人在该时间点之前遇到该事件的次数相加(让我们将此输出数据表称为"Desired").

我通过循环编写了一些代码,但考虑到我的数据大小(数百万行),它显然不太顺利……有什么合适的数据操作方法可以应用于数据表"Current",以将其转换为数据表"Desired"?

编辑:我应该提到,我的原始数据集包括需要带入最终结果的列.我很抱歉最初没有包括这一点.

library(data.table)

current <- data.table(
  person.id = c(1,2,3,1,2,3),
  event = factor(c("categoryA", "categoryA", "categoryD", 
                   "categoryA", "categoryC", "categoryB")),
  date = as.Date(c("2020-01-01", "2020-03-23", "2020-09-09", 
                   "2020-12-30", "2021-06-03", "2022-03-22"))
)

desired <- current |>
  someManipulation(...)

print(current)
print(desired)

(输出)

   person.id     event       date
1:         1 categoryA 2020-01-01
2:         2 categoryA 2020-03-23
3:         3 categoryD 2020-09-09
4:         1 categoryA 2020-12-30
5:         2 categoryC 2021-06-03
6:         3 categoryB 2022-03-22

   person.id     event       date categoryA categoryB categoryC categoryD
1:         1 categoryA 2020-01-01         1         0         0         0
2:         2 categoryA 2020-03-23         1         0         0         0
3:         3 categoryD 2020-09-09         0         0         0         1
4:         1 categoryA 2020-12-30         2         0         0         0
5:         2 categoryC 2021-06-03         1         0         1         0
6:         3 categoryB 2022-03-22         0         1         0         1

推荐答案

可以基于值event创建列,如果每行中的列名等于event值,则为其分配1.然后,您可以使用cumsum乘以person.id来累加值.

library(data.table)

categories <- sort(as.character(unique(current$event)))

current[, (categories) := lapply(
                            lapply(categories, function(x) +(event == x)), 
                                 cumsum), .(person.id)][]

#>    person.id     event       date categoryA categoryB categoryC categoryD
#> 1:         1 categoryA 2020-01-01         1         0         0         0
#> 2:         2 categoryA 2020-03-23         1         0         0         0
#> 3:         3 categoryD 2020-09-09         0         0         0         1
#> 4:         1 categoryA 2020-12-30         2         0         0         0
#> 5:         2 categoryC 2021-06-03         1         0         1         0
#> 6:         3 categoryB 2022-03-22         0         1         0         1

如果您更喜欢dcast,并且希望保留顺序并避免日期冲突,请try 在dcast之前添加行ID号.

result  <- dcast(current[, rid := .I], 
                    rid + person.id + date + event ~ event, 
                    fun = length)[, rid := NULL][]

cols <- setdiff(names(result), names(current))

result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols][]

#>    person.id       date     event categoryA categoryB categoryC categoryD
#> 1:         1 2020-01-01 categoryA         1         0         0         0
#> 2:         2 2020-03-23 categoryA         1         0         0         0
#> 3:         3 2020-09-09 categoryD         0         0         0         1
#> 4:         1 2020-12-30 categoryA         2         0         0         0
#> 5:         2 2021-06-03 categoryC         1         0         1         0
#> 6:         3 2022-03-22 categoryB         0         1         0         1

不过,我觉得很难相信dcast更快:

library(microbenchmark)

microbenchmark(
  M_categories = {
    categories <- sort(as.character(unique(current$event)))
    current[, (categories) := lapply(
                                lapply(categories, function(x) event == x),
                                cumsum), .(person.id)]
  },
  
  M_dcast = {
    result  <- dcast(current[, rid := .I], 
                        rid + person.id + date + event ~ event, 
                        fun = length)[, rid := NULL][]
    cols <- setdiff(names(result), names(current))
    result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols][]
  },
  
  GT_dcats = {
    result = dcast(current, formula = person.id + date ~ event, fun.aggregate = length)
    cols = names(result)[names(result) %like% "category"]
    result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols]
  }
  )

Unit: microseconds
         expr    min      lq     mean  median      uq     max neval
 M_categories  563.4  600.45  655.900  626.15  651.95  1302.0   100
      M_dcast 3963.3 4235.40 4627.580 4443.90 4620.25 15476.1   100
     GT_dcats 3355.5 3554.40 4030.594 3711.10 3929.75 14002.1   100

对于较大的数据集,dcast的速度明显较慢:

large_data <- current[sample(nrow(current), 10000, replace = TRUE), ]

microbenchmark(
  M_categories = {
    categories <- sort(as.character(unique(large_data$event)))
    large_data[, (categories) := lapply(
                                lapply(categories, function(x) event == x),
                                cumsum), .(person.id)]
  },

  M_dcast = {
    result  <- dcast(large_data[, rid := .I], 
                        rid + person.id + date + event ~ event, 
                        fun = length)[, rid := NULL][]
    cols <- setdiff(names(result), names(large_data))
    result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols][]
  },

  GT_dcats = {
    result = dcast(large_data, formula = person.id + date ~ event, fun.aggregate = length)
    cols = names(result)[names(result) %like% "category"]
    result[, (cols) := lapply(.SD, cumsum), by = person.id, .SDcols = cols]
  }
  )

Unit: milliseconds
         expr     min       lq     mean   median       uq     max neval
 M_categories  1.4199  1.65105  2.23003  1.98575  2.45670 10.9188   100
      M_dcast 26.1525 27.53905 28.46625 28.07925 28.64415 36.4706   100
     GT_dcats  4.3829  4.94760  5.49732  5.44890  5.78995 15.0705   100

创建于2024-01-29年第reprex v2.0.2

R相关问答推荐

R中的枢轴/转置

使用scale_x_continuous复制ggplot 2中的离散x轴

过滤矩阵以获得R中的唯一组合

是否可以 Select 安装不带文档的R包以更有效地存储?

从R导出全局环境中的所有sf(numrames)对象

提取具有连续零值的行,如果它们前面有R中的有效值

par函数中的缩写,比如mgp,mar,mai是如何被破译的?

在使用bslb和bootstrap5时,有没有办法更改特定dt行的 colored颜色 ?

线性模型斜率在减少原始数据时提供NA

将多个列值转换为二进制

如何计算R glm probit中的线性预测因子?

为什么在写入CSV文件时Purrr::Pwalk不起作用

R中治疗序列的相对时间指数

按两个因素将观测值分组后计算单独的百分比

如何从嵌套数据中自动创建命名对象?在R中

为什么将负值向量提升到分数次方会得到NaN

使用ggplot2绘制具有边缘分布的坡度图

如何获取R chromote中的当前URL?

根据排名的顶点属性调整曲线图布局(&Q)

使用dqur在不同变量上创建具有多个条件的变量