如何在 R 中有效地使用 Rprof

发布于04月05日

我想知道是否有可能从R个代码中获得一个类似于matlab的Profiler的概要文件.也就是说，了解哪些行号是特别慢的行号.

到目前为止，我所取得的成绩并不令人满意.我用Rprof来制作一个档案文件.使用summaryRprof，我得到如下结果:

$by.self
                  self.time self.pct total.time total.pct
[.data.frame               0.72     10.1       1.84      25.8
inherits                   0.50      7.0       1.10      15.4
data.frame                 0.48      6.7       4.86      68.3
unique.default             0.44      6.2       0.48       6.7
deparse                    0.36      5.1       1.18      16.6
rbind                      0.30      4.2       2.22      31.2
match                      0.28      3.9       1.38      19.4
[<-.factor                 0.28      3.9       0.56       7.9
levels                     0.26      3.7       0.34       4.8
NextMethod                 0.22      3.1       0.82      11.5
...

和

$by.total
                      total.time total.pct self.time self.pct
data.frame                  4.86      68.3      0.48      6.7
rbind                       2.22      31.2      0.30      4.2
do.call                     2.22      31.2      0.00      0.0
[                           1.98      27.8      0.16      2.2
[.data.frame                1.84      25.8      0.72     10.1
match                       1.38      19.4      0.28      3.9
%in%                        1.26      17.7      0.14      2.0
is.factor                   1.20      16.9      0.10      1.4
deparse                     1.18      16.6      0.36      5.1
...

To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often 和 (b) I never use e.g., deparse. Furthermore, what is [?

所以我try 了哈德利·威克姆的profr，但考虑到下面的图表，它没有任何用处:

Is there a more convenient way to see which line numbers 和 particular function calls are slow?
Or, is there some literature that I should consult?

任何提示都将不胜感激.

EDIT 1:
Based on Hadley's comment I will paste the code of my script below 和 the base graph version of the plot. But note, that my question is not related to this specific script. It is just a r和om script that I recently wrote. I am looking for a general way of how to find bottlenecks 和 speed up R-code.

数据(x)如下所示:

type      word    response    N   Classification  classN
Abstract  ANGER   bitter      1   3a              3a
Abstract  ANGER   control     1   1a              1a
Abstract  ANGER   father      1   3a              3a
Abstract  ANGER   flushed     1   3a              3a
Abstract  ANGER   fury        1   1c              1c
Abstract  ANGER   hat         1   3a              3a
Abstract  ANGER   help        1   3a              3a
Abstract  ANGER   mad         13  3a              3a
Abstract  ANGER   management  2   1a              1a
... until row 1700

playbook (有简短的解释)是这样的:

Rprof("profile1.out")

# A new dataset is produced with each line of x contained x$N times 
y <- vector('list',length(x[,1]))
for (i in 1:length(x[,1])) {
  y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))
}
all <- do.call('rbind',y)
colnames(all) <- colnames(x)

# create a dataframe out of a word x class table
table_all <- table(all$word,all$classN)
dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])
dataf.all$words <- as.factor(rownames(dataf.all))
dataf.all$type <- "no"
# get type of the word.
words <- levels(dataf.all$words)
for (i in 1:length(words)) {
  dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])
}
dataf.all$type <- as.factor(dataf.all$type)
dataf.all$typeN <- as.numeric(dataf.all$type)

# aggregate response categories
dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)
dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)
dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)

Rprof(NULL)

library(profr)
ggplot.profr(parse_rprof("profile1.out"))

最终数据如下所示:

1a    1b  1c  1d  1e  1f  2a  2b  2c  3a  3b  pa  words   type    typeN   c1  c2  c3  pa
3 0   8   0   0   0   0   0   0   24  0   0   ANGER   Abstract    1   11  0   24  0
6 0   4   0   1   0   0   11  0   13  0   0   ANXIETY Abstract    1   11  11  13  0
2 11  1   0   0   0   0   4   0   17  0   0   ATTITUDE    Abstract    1   14  4   17  0
9 18  0   0   0   0   0   0   0   0   8   0   BARREL  Concrete    2   27  0   8   0
0 1   18  0   0   0   0   4   0   12  0   0   BELIEF  Abstract    1   19  4   12  0

基本图表:

Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.

# create big data frame: n <- 1000 x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n)) # reasonable operations: marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean) # unreasonable operations: marginal.means.2 <- marginal.means.1[NULL,] row.counter <- 1 for (condition in levels(x$condition)) { for (group in levels(x$group)) { tmp.value <- 0 tmp.length <- 0 for (c in 1:nrow(x)) { if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) { tmp.value <- tmp.value + x[c,"data"] tmp.length <- tmp.length + 1 } } marginal.means.2[row.counter,"group"] <- group marginal.means.2[row.counter,"condition"] <- condition marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length row.counter <- row.counter + 1 } } # does it produce the same results? all.equal(marginal.means.1, marginal.means.2)

$by.self self.time self.pct total.time total.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.total total.time total.pct self.time self.pct download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 $by.line self.time self.pct total.time total.pct <no location> 4.38 34.93 4.38 34.93 download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48 download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11 download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16 download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16 $sample.interval [1] 0.02 $sampling.time [1] 12.54

如何在 R 中有效地使用 Rprof

推荐答案

R相关问答推荐

基于两个现有列创建新列

pdf Quarto中的中心美人鱼

R gtsummary tBL_summary，包含分层和两个独立分组变量

在ubuntu 22.04上更新到R4.4后包安装出现编译错误

无法运行通过R中的Auto.arima获得的ARIMA模型

从嵌套列表中智能提取线性模型系数

向gggplot 2中的数据和轴标签添加大写和星号

根据R中两个变量的两个条件删除带有dspirr的行

如果第一个列表中的元素等于第二个列表的元素，则替换为第三个列表的元素

非线性混合效应模型(NLME)预测变量的置信区间

计算满足R中条件的连续列

R中的哈密顿滤波

SHINY：使用JS函数应用的CSS样式显示HTML表格

R：用GGPLATE，如何在两个独立的变量中制作不同形状的散点图？

当每个变量值只能 Select 一次时，如何从数据框中 Select 两个变量的组合？

解析嵌套程度极高的地理数据

自定义交互作用图的标签

我正在try 创建一个接近cos(X)的值的While循环，以便它在-或+1-E10范围内

GOGPLATE geom_boxploy色彩疯狂

如何在shiny 的应用程序 map 视图宣传单中可视化单点