R 在不带max()的data.table中按组查找最后一个元素

发布于01月23日

我正在试着以一种省时的方式找到data.table个分组中的最后一个元素.我有一个有效的解决方案:

library(data.table)
Data <- data.table(id = c(rep("a", 2), rep("b",3)), 
                   time = c(1:2, 1:3))
Data[, lastobs := max(time), by = id]
Data <- Data[time == lastobs]
Data[, lastobs := NULL]

但max() by group命令需要相当长的时间.在我仍然可以管理的300万个观测值的数据集中，时间是一个年月变量，它需要20秒，我需要在很多年的月份中这样做.我现在想要转移到一个更大的数据集，在那里这变得不可行.我在想一定有一种data.table的方法可以做到这一点，方法是简单地按id和时间对data.table进行排序，然后使用我从来不懂的.I、.N或.SD速记，简单地按组保留最后一个元素，而不必在每个组中计算类似于max()的东西.有没有这样的解决方案？我的try 是:

Data <- data.table(id = c(rep("a", 2), rep("b",3)), 
               time = c(1:2, 1:3))
Data[,.N == .I, by = id]

Select 第一组的最后一行和第二组的第一行，因为我并不真正理解这个语法...

编辑:判断GForce出了什么问题:

> sessionInfo()

R version 4.3.2 (2023-10-31 ucrt)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 10 x64 (build 19045)


Matrix products: default



locale:

[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   

[5] LC_TIME=German_Germany.utf8    


time zone: Europe/Berlin

tzcode source: internal


attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

[1] data.table_1.14.10


loaded via a namespace (and not attached):

[1] compiler_4.3.2 tools_4.3.2   

> Data <- data.table(id = c(rep("a", 2), rep("b",3)), 

+                    time = c(1:2, 1:3))

> Data[verbose = TRUE, , lastobs := max(time), by = id]

Detected that j uses these columns: time 

Finding groups using forderv ... forder.c received 5 rows and 1 columns

0.000s elapsed (0.000s cpu) 

Finding group sizes from the positions (can be avoided to save RAM) ... 
0.000s elapsed (0.000s cpu) 

lapply optimization is on, j unchanged as 'max(time)'

Old mean optimization is on, left j unchanged.

Making each group and running j (GForce FALSE) ... 
memcpy contiguous groups took 0.000s for 2 groups
eval(j) took 0.000s for 2 calls

0.000s elapsed (0.000s cpu)

library(data.table) Data <- data.table(id = sample(1e5, 3e6, 1), time = runif(3e6)) microbenchmark::microbenchmark( max = dt[, .SD[time == max(time)], by = id], order = dt[, .SD[order(-time)][1], by = id], SD = setorder(dt, time)[, .SD[.N,], by = id], duplicated = setorder(dt, -time)[duplicated(id) == FALSE], which.max = Data[Data[,.I[which.max(time)], id][[2]]], setup = {dt <- copy(Data)}, unit = "relative", times = 10 ) #> Unit: relative #> expr min lq mean median uq max neval #> max 45.5618283 44.17181 42.815332 41.820841 44.65020 37.1529480 10 #> order 120.9743203 117.88609 112.860512 111.569424 114.80476 99.5545706 10 #> SD 42.6016642 40.95944 38.931049 38.914373 39.36850 33.3685212 10 #> duplicated 1.0000000 1.00000 1.000000 1.000000 1.00000 1.0000000 10 #> which.max 0.8926029 1.02968 1.012396 1.025658 1.07502 0.9811257 10

microbenchmark::microbenchmark( original = dt[,lastobs := max(time), id][time == lastobs][,lastobs := NULL], unique = unique(setorder(dt, -time), by = "id"), duplicated = setorder(dt, -time)[duplicated(id) == FALSE], which.max = Data[Data[,.I[which.max(time)], id][[2]]], setup = {dt <- copy(Data)}, unit = "relative" ) #> Unit: relative #> expr min lq mean median uq max neval #> original 1.068006 1.060215 1.113420 1.079614 1.143387 1.112775 100 #> unique 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 #> duplicated 1.193816 1.219129 1.269957 1.241923 1.307255 1.163229 100 #> which.max 1.036372 1.017441 1.062959 1.024766 1.090846 1.030185 100

R 在不带max()的data.table中按组查找最后一个元素

推荐答案

R相关问答推荐

如何根据包含相同值的某些列获取总额

找出疾病消失的受试者

名称输出pmap on tible

矩阵%*%矩阵中的错误：需要数字/复杂矩阵/向量参数

pickerInput用于显示一条或多条geom_hline，这些线在图中具有不同 colored颜色

任意列的欧几里得距离

修改用R编写的用户定义函数

如何计算多个日期是否在一个日期范围内

在R gggplot2中是否有一种方法将绘图轴转换成连续的 colored颜色尺度？

`夹心：：vcovCL`不等于`AER：：tobit`标准错误

悬崖三角洲超大型群数计算导致整数溢出

是否有新方法来更改Facet_WRAP(Ggplot2)中条文本的文本 colored颜色？

如何将一列中的值拆分到R中各自的列中

使用shiny 中的所选要素行下拉菜单

判断函数未加载R中的库

避免在图例中显示VLINS组

如何在内联代码中添加额外的空格(R Markdown)

Data.table：：Shift type=允许扩展数据(&Q；LAG&Q；)

具有由向量定义的可变步长的序列

使用另一列中的增长率外推R(使用dplyr)