我正在处理一个 Big Data 表,需要按组查找行号.不幸的是,对数据表进行排序不是一个选项,因为它们被索引到多个位置(按id、时间等)所以我认为setkey
不能用.
What is the most efficient way to approach this problem?
我目前试过which(...)
、k[..., which = TRUE]
和k[, .I[...]]
.有没有更快的办法?
通过基准测试,对于较小的数据表(少于which(...)
00行,完整代码如下),which(...)
似乎比k[..., which = TRUE]
更有效:
test replications elapsed relative user.self sys.self
1 k[a == x, which = TRUE] 10 2.63 1.789 2.52 0.10
2 which(k$a == x) 10 1.47 1.000 1.47 0.00
3 setindex(k, a) 10 2.71 1.844 2.64 0.06
4 k[, .I[a == x]] 10 2.03 1.381 2.00 0.00
但随着行数的增加,k[..., which = TRUE]
的速度要快得多:
> rbenchmark::benchmark(
+ "A" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ m <- lapply(u, function(x) k[a == x, which = TRUE])
+ },
+ "B" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ m <- lapply(u, function(x) which(k$a == x))
+ },
+ "C" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ setindex(k, a)
+ m <- lapply(u, function(x) k[a == x, which = TRUE])
+ },
+ "D" = {
+ k <- data.table(
+ a = sample(factor(seq_len(200)), size = 1000000, replace = TRUE)
+ )
+ u <- unique(k$a)
+ setindex(k, a)
+ m <- lapply(u, function(x) k[, .I[a == x]])
+ },
+ replications = 10,
+ columns = c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
1 A 10 3.64 1.000 3.61 0.08
2 B 10 43.22 11.874 42.73 0.02
3 C 10 3.70 1.016 3.72 0.04
4 D 10 46.71 12.832 46.33 0.03