data.table vs dplyr：一个人可以做得很好，而另一个人不能或做得很差

发布于12月31日

概述

我对data.table比较熟悉，对dplyr不太熟悉.我已经阅读了大约dplyr vignettes个例子，到目前为止，我的结论是:

data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyr有更容易理解的语法
dplyr篇摘要(或将)潜在的数据库交互
存在一些小的功能差异(请参见下面的"示例/用法")

在我心目中2.没有太大的影响力，因为我对它非常熟悉，尽管我知道对于两个领域的新手来说，这将是一个很大的因素.我想避免关于哪个更直观的争论，因为这与我从熟悉data.table的人的Angular 提出的具体问题无关.我还想避免讨论"更直观"如何导致更快的分析(当然是真的，但再一次，这不是我最感兴趣的).

问题

我想知道的是:

对于熟悉软件包的人来说，使用一个或另一个软件包编写分析任务是否容易得多(例如，所需的击键与所需的深奥程度的某种组合，其中每种击键次数越少越好).
是否存在在一个包中比在另一个包中更有效地执行分析任务(即超过2倍)的情况.

一个recent SO question让我更加思考这个问题，因为在那之前，我不认为dplyr能提供比我在data.table中已经能做的更多的东西.以下是dplyr解决方案(数据在Q末尾):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

这比我try 破解data.table个解决方案要好得多.也就是说，data.table个好的解决方案也是相当好的(感谢Jean-Robert，Arun，请注意，在这里，我倾向于单一陈述，而不是严格意义上的最佳解决方案):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

后者的语法可能看起来非常深奥，但如果你习惯了data.table(也就是说，不使用一些更深奥的技巧)，它实际上非常简单.

理想情况下，我想看到的是一些很好的例子，如果dplyr或data.table的方式更简洁或性能更好.

例子

Usage

dplyr不允许返回任意行数的分组操作(从100开始，注意:这看起来将在101中实现，@初学者在回答@eddi的问题时使用do也显示了一个潜在的解决方案).
data.table支持100(感谢@dholstius)和101
data.table对speed到automatic indexing的DT[col == value]或DT[col %in% values]形式的表达式进行内部优化，该表达式使用binary search，同时使用相同的基本R语法.See here了解更多细节和一个小基准.
dplyr提供了标准的函数判断版本(例如regroup、summarize_each_)，可以简化dplyr的编程使用(注:data.table的编程使用绝对是可能的，只需要仔细考虑、替换/引用等，至少据我所知)

Benchmarks

I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster.
@Arun运行了大约100个，显示随着组数的增加，data.table个组的伸缩性优于dplyr个组(更新了最近两个包中的增强功能和最新版本的R).此外，当一个基准测试试图获得101分时，速度会快data.table~6倍.
(未经验证)在较大版本的组/应用/排序上的速度快data.table 75%，而在较小版本上的速度快dplyr 40%(感谢danas，100).
《data.table》的主要作者马特有benchmarked grouping operations on data.table, dplyr and python pandas on up to 2 billion rows (~100GB in RAM)本书.
100的速度提高了data.table~8倍

数据

这是我在问题部分展示的第一个例子.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }

DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3

# case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])

# case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??

DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

data.table vs dplyr：一个人可以做得很好，而另一个人不能或做得很差

概述

问题

例子

数据

推荐答案

1.速度

2.内存使用

data.table way

dplyr equivalent

3.语法

4.特点

R相关问答推荐

更改网格的crs以匹配简单要素点对象的crs

行式dppr中的变量列名

无法定义沿边轨迹的 colored颜色渐变(与值无关)

函数可以跨多个列搜索多个字符串并创建二进制输出变量

R如何计算现有行的总和以添加新的数据行

从多层嵌套列表构建Tibble？

使用ggplot2中的sec_axis()调整次轴

将列表中的字符串粘贴到R中for循环内的dplyr筛选器中

为什么函数toTitleCase不能处理english(1)，而toupper可以？

根据r中每行中的日期序列，使用列名序列创建新列

计算多变量的加权和

在ggplot2图表中通过端点连接点

是否从列中删除★符号？

conditionPanel不考虑以下条件

组合名称具有模式的列表的元素

如何捕获这个shiny 的、可扩展的react 性用户输入矩阵作为另一个react 性对象，以便进一步操作？

将字符变量出现次数不相等的字符框整形为pivot_wider，而不删除重复名称或嵌套字符变量

R：水平旋转图

如何在分组蜂群小区中正确定位标签

如何在类应用函数中访问函数本身