使用 tidyverse 在分组 df 中更快完成()

发布于04月22日

昨天I posted讲了如何在组值内展开/完成.

在一个最小的示例DF上测试解决方案效果很好，但在我的真实数据上，几个小时后就不能计算了.

我的dplyr管道流程如下所示:

mydf <- mydf |>
group_by_at(vars(id:trial_day)) |> 
  summarise_at(vars(bla:last_col()), sum) |> 
  complete(trial_day = 1:14)

我试着用complete()换成expand()，但这样做的结果是只保留了分组的var，其他var被丢弃了.

df <- data.frame(
  id = rep('a', 5),
  x = 6:10,
  y = 5:1
)

# returns all cols
df |> 
  group_by(id) |> 
  complete(x = 1:10)

# only returns id and x, no y
df |> 
  group_by(id) |> 
  expand(x = 1:10)

但我不确定扩张是否会更快.

我try 在DFladder = data.frame(day = 1:14)上进行右连接，但这导致扩展行具有用于分组变量的NA，我希望在发生扩展时填充这些行.

有没有比我用complete()更快的方法得到同样的结果？

推荐答案

如果您有一个固定值来跨组扩展，就像您共享的示例中一样，您不需要添加group_by.我们想要complete个固定值(1:10)，而不考虑它所在的组.

如果从代码中删除group_by，与分组数据相比，代码的速度要快3倍.我已经创建了一个更大的例子来演示这一点.

# New example data
df <- data.frame(
  id = sample(c(letters, LETTERS), 1e6, replace = TRUE),
  x = sample(9, 1e6, replace = TRUE),
  y = sample(5, 1e6, replace = TRUE)
)

# compare the results with and without group_by and make sure they are the same

library(tidyr)
library(dplyr)

res1 <- df |> group_by(id) |> complete(x = 1:10) |> ungroup() |> arrange(id)
res2 <- df |> complete(id, x = 1:10) |> arrange(id)

identical(res1, res2)
#[1] TRUE

# Check the performance
microbenchmark::microbenchmark(
  with_group = df |> group_by(id) |> complete(x = 1:10), 
  without_group = df |> complete(id, x = 1:10)
)

# Unit: milliseconds
#          expr       min        lq     mean   median       uq      max neval cld
#    with_group 257.97722 290.68443 354.1348 311.0921 344.7817 673.3979   100  a 
# without_group  76.98431  98.45134 144.6897 107.9547 116.0458 539.7276   100   b