无涯教程网

reshape data.table 的正确最快方法

发布于08月02日

I have a data table in R:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

推荐答案

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.
dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().
melt:
- is capable of melting on columns of type 'list'.
- gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).
- melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.
- NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.
dcast:
- accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.
- use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
- melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
- dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

R相关问答推荐

以R为基数排列奇数个图

更改绘图上的x轴断点，而不影响风险？

如何对数据集进行逆向工程？

基于不同组的列的相关性

在R中使用数据集名称

R函数，用于生成伪随机二进制序列，其中同一数字在一行中不出现超过两次

在嵌套列表中查找元素路径的最佳方法

过滤名称以特定字符串开头的文件

将多个列值转换为二进制

基于Key->Value数据帧的基因子集相关性提取

从多面条形图中删除可变部分

调换行/列并将第一行(原始数据帧的第一列)提升为标题的Tidyr类似功能？

如何使用FormC使简单算术运算得到的数字是正确的？

错误包arrowR：READ_PARQUET/OPEN_DATASET&QOT；无法反序列化SARIFT：TProtocolException：超出大小限制&Quot；

计算多变量的加权和

GOGPLATE geom_boxploy色彩疯狂

如何在shiny 的应用程序 map 视图宣传单中可视化单点

隐藏基于 case 总数的值

策略表单连接两个非常大的箭头数据集，而不会 destruct 内存使用

打印的.txt文件，将值显示为&Quot；Num&Quot；而不是值

实用课程推荐

相关教程推荐