我们不是在计算每个Value
中有多少个Value
,我们希望看到if个其他值与"this"值匹配,然后计算每个组的出现次数.使用排序值减少了大量的数学运算,并使我们能够更快地确定事情.
原始数据:
library(dplyr)
data %>%
arrange(Value) %>%
group_by(Group) %>%
summarize(
Count = sum((Value - lag(Value)) < 5 | (lead(Value) - Value) < 5, na.rm = TRUE)
)
# # A tibble: 2 × 2
# Group Count
# <chr> <int>
# 1 A 3
# 2 B 3
从ThomasIsCoding借来的更 Big Data :
set.seed(1701)
data <- tibble(
Group = sample(LETTERS, size = 200000, replace = TRUE),
Value = sample(1:100, size = 200000, replace = TRUE)
)
data
# # A tibble: 200,000 × 2
# Group Value
# <chr> <int>
# 1 H 91
# 2 O 27
# 3 W 70
# 4 G 33
# 5 D 42
# 6 F 70
# 7 X 66
# 8 X 37
# 9 X 68
# 10 D 45
# # ℹ 199,990 more rows
# # ℹ Use `print(n = ...)` to see more rows
重新运行dplyr
方法:
data %>%
arrange(Value) %>%
group_by(Group) %>%
summarize(Count = sum((Value - lag(Value)) < 5 | (lead(Value) - Value) < 5, na.rm = TRUE))
# # A tibble: 26 × 2
# Group Count
# <chr> <int>
# 1 A 7819
# 2 B 7541
# 3 C 7783
# 4 D 7574
# 5 E 7662
# 6 F 7850
# 7 G 7727
# 8 H 7710
# 9 I 7515
# 10 J 7707
# # ℹ 16 more rows
# # ℹ Use `print(n = ...)` to see more rows
我认为data.table
英里可能会更快,
library(data.table)
setDT(data)
setorder(data, Group, Value) # just Value would be fine too
data[, sum(Value - lag(Value) < 5 | lead(Value) - Value < 5, na.rm = TRUE), by = "Group"]
# Group V1
# <char> <int>
# 1: A 7819
# 2: B 7541
# 3: C 7783
# 4: D 7574
# 5: E 7662
# 6: F 7850
# 7: G 7727
# 8: H 7710
# 9: I 7515
# 10: J 7707
# ---
# 17: Q 7560
# 18: R 7614
# 19: S 7771
# 20: T 7833
# 21: U 7700
# 22: V 7770
# 23: W 7730
# 24: X 7672
# 25: Y 7648
# 26: Z 7808
这两个都在1秒内运行,事实上,它们的相对性能很好,略高于data.table
:
bench::mark(
dplyr = data %>%
arrange(Value) %>%
group_by(Group) %>%
summarize(Count = sum((Value - lag(Value)) < 5 | (lead(Value) - Value) < 5, na.rm = TRUE)),
data.table = data[, sum(Value - lag(Value) < 5 | lead(Value) - Value < 5, na.rm = TRUE), by = "Group"],
check = FALSE, min_iterations = 100)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 dplyr 23.2ms 25.8ms 37.4 NA 0.763 98 2 2.62s <NULL> <NULL> <bench_tm [100]> <tibble [100 × 3]>
# 2 data.table 13.2ms 19ms 53.1 NA 0 100 0 1.88s <NULL> <NULL> <bench_tm [100]> <tibble [100 × 3]>