我有一个模拟,有一个巨大的集合,并结合在中间的步骤.我使用plyr的ddply()函数原型化了这个过程,它非常适合我大部分的需求.但我需要加快聚合步骤,因为我必须运行10K模拟.我已经在并行扩展模拟,但如果这一步更快,我可以大大减少我需要的 node 数量.
以下是我试图做的合理简化:
library(Hmisc)
# Set up some example data
year <- sample(1970:2008, 1e6, rep=T)
state <- sample(1:50, 1e6, rep=T)
group1 <- sample(1:6, 1e6, rep=T)
group2 <- sample(1:3, 1e6, rep=T)
myFact <- rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)
# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
function(df) wtd.mean(df$myFact, weights=df$weights)
)
)
感谢所有提示或建议!