我正在试着以一种省时的方式找到data.table
个分组中的最后一个元素.我有一个有效的解决方案:
library(data.table)
Data <- data.table(id = c(rep("a", 2), rep("b",3)),
time = c(1:2, 1:3))
Data[, lastobs := max(time), by = id]
Data <- Data[time == lastobs]
Data[, lastobs := NULL]
但max()
by group命令需要相当长的时间.在我仍然可以管理的300万个观测值的数据集中,时间是一个年月变量,它需要20秒,我需要在很多年的月份中这样做.我现在想要转移到一个更大的数据集,在那里这变得不可行.我在想一定有一种data.table
的方法可以做到这一点,方法是简单地按id和时间对data.table进行排序,然后使用我从来不懂的.I
、.N
或.SD
速记,简单地按组保留最后一个元素,而不必在每个组中计算类似于max()
的东西.有没有这样的解决方案?我的try 是:
Data <- data.table(id = c(rep("a", 2), rep("b",3)),
time = c(1:2, 1:3))
Data[,.N == .I, by = id]
Select 第一组的最后一行和第二组的第一行,因为我并不真正理解这个语法...
编辑:判断GForce出了什么问题:
> sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
[5] LC_TIME=German_Germany.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.10
loaded via a namespace (and not attached):
[1] compiler_4.3.2 tools_4.3.2
> Data <- data.table(id = c(rep("a", 2), rep("b",3)),
+ time = c(1:2, 1:3))
> Data[verbose = TRUE, , lastobs := max(time), by = id]
Detected that j uses these columns: time
Finding groups using forderv ... forder.c received 5 rows and 1 columns
0.000s elapsed (0.000s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ...
0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'max(time)'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
memcpy contiguous groups took 0.000s for 2 groups
eval(j) took 0.000s for 2 calls
0.000s elapsed (0.000s cpu)