我想知道是否有可能从R个代码中获得一个类似于matlab的Profiler的概要文件.也就是说,了解哪些行号是特别慢的行号.

到目前为止,我所取得的成绩并不令人满意.我用Rprof来制作一个档案文件.使用summaryRprof,我得到如下结果:

$by.self
                  self.time self.pct total.time total.pct
[.data.frame               0.72     10.1       1.84      25.8
inherits                   0.50      7.0       1.10      15.4
data.frame                 0.48      6.7       4.86      68.3
unique.default             0.44      6.2       0.48       6.7
deparse                    0.36      5.1       1.18      16.6
rbind                      0.30      4.2       2.22      31.2
match                      0.28      3.9       1.38      19.4
[<-.factor                 0.28      3.9       0.56       7.9
levels                     0.26      3.7       0.34       4.8
NextMethod                 0.22      3.1       0.82      11.5
...

$by.total
                      total.time total.pct self.time self.pct
data.frame                  4.86      68.3      0.48      6.7
rbind                       2.22      31.2      0.30      4.2
do.call                     2.22      31.2      0.00      0.0
[                           1.98      27.8      0.16      2.2
[.data.frame                1.84      25.8      0.72     10.1
match                       1.38      19.4      0.28      3.9
%in%                        1.26      17.7      0.14      2.0
is.factor                   1.20      16.9      0.10      1.4
deparse                     1.18      16.6      0.36      5.1
...

To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often 和 (b) I never use e.g., deparse. Furthermore, what is [?

所以我try 了哈德利·威克姆的profr,但考虑到下面的图表,它没有任何用处:

Is there a more convenient way to see which line numbers 和 particular function calls are slow?
Or, is there some literature that I should consult?

任何提示都将不胜感激.

EDIT 1:
Based on Hadley's comment I will paste the code of my script below 和 the base graph version of the plot. But note, that my question is not related to this specific script. It is just a r和om script that I recently wrote. I am looking for a general way of how to find bottlenecks 和 speed up R-code.

数据(x)如下所示:

type      word    response    N   Classification  classN
Abstract  ANGER   bitter      1   3a              3a
Abstract  ANGER   control     1   1a              1a
Abstract  ANGER   father      1   3a              3a
Abstract  ANGER   flushed     1   3a              3a
Abstract  ANGER   fury        1   1c              1c
Abstract  ANGER   hat         1   3a              3a
Abstract  ANGER   help        1   3a              3a
Abstract  ANGER   mad         13  3a              3a
Abstract  ANGER   management  2   1a              1a
... until row 1700

playbook (有简短的解释)是这样的:

Rprof("profile1.out")

# A new dataset is produced with each line of x contained x$N times 
y <- vector('list',length(x[,1]))
for (i in 1:length(x[,1])) {
  y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))
}
all <- do.call('rbind',y)
colnames(all) <- colnames(x)

# create a dataframe out of a word x class table
table_all <- table(all$word,all$classN)
dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])
dataf.all$words <- as.factor(rownames(dataf.all))
dataf.all$type <- "no"
# get type of the word.
words <- levels(dataf.all$words)
for (i in 1:length(words)) {
  dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])
}
dataf.all$type <- as.factor(dataf.all$type)
dataf.all$typeN <- as.numeric(dataf.all$type)

# aggregate response categories
dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)
dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)
dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)

Rprof(NULL)

library(profr)
ggplot.profr(parse_rprof("profile1.out"))

最终数据如下所示:

1a    1b  1c  1d  1e  1f  2a  2b  2c  3a  3b  pa  words   type    typeN   c1  c2  c3  pa
3 0   8   0   0   0   0   0   0   24  0   0   ANGER   Abstract    1   11  0   24  0
6 0   4   0   1   0   0   11  0   13  0   0   ANXIETY Abstract    1   11  11  13  0
2 11  1   0   0   0   0   4   0   17  0   0   ATTITUDE    Abstract    1   14  4   17  0
9 18  0   0   0   0   0   0   0   0   8   0   BARREL  Concrete    2   27  0   8   0
0 1   18  0   0   0   0   4   0   12  0   0   BELIEF  Abstract    1   19  4   12  0

基本图表:

Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.

推荐答案

《昨日breaking news》(R 3.0.0终于出炉)的读者可能已经注意到了一些与这个问题直接相关的有趣内容:

  • 通过Rprof()进行的评测现在可以 Select 在语句级别记录信息,而不仅仅是函数级别.

事实上,这个新功能回答了我的问题,我将展示如何实现.


比方说,我们想比较向量化和预分配在计算汇总统计数据(如平均值)时是否真的比良好的旧for循环和增量数据构建更好.相对愚蠢的代码如下:

# create big data frame:
n <- 1000
x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))

# reasonable operations:
marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)

# unreasonable operations:
marginal.means.2 <- marginal.means.1[NULL,]

row.counter <- 1
for (condition in levels(x$condition)) {
  for (group in levels(x$group)) {  
    tmp.value <- 0
    tmp.length <- 0
    for (c in 1:nrow(x)) {
      if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
        tmp.value <- tmp.value + x[c,"data"]
        tmp.length <- tmp.length + 1
      }
    }
    marginal.means.2[row.counter,"group"] <- group 
    marginal.means.2[row.counter,"condition"] <- condition
    marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
    row.counter <- row.counter + 1
  }
}

# does it produce the same results?
all.equal(marginal.means.1, marginal.means.2)

要将此代码与Rprof一起使用,我们需要将其转换为parse.也就是说,它需要保存在一个文件中,然后从那里调用.因此,我把它上传到了pastebin,但它对本地文件的作用完全相同.

现在,我们

  • 只需创建一个配置文件并表明我们想要保存行号,
  • 用令人难以置信的组合eval(parse(..., keep.source = TRUE))源代码(似乎臭名昭著的fortune(106)不适用于这里,因为我还没有找到其他方法)
  • 停止分析,并根据行号指示我们想要输出.

代码是:

Rprof("profile1.out", line.profiling=TRUE)
eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
Rprof(NULL)

summaryRprof("profile1.out", lines = "show")

它给出:

$by.self
                           self.time self.pct total.time total.pct
download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
<no location>                   4.38    34.93       4.38     34.93
download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16

$by.total
                           total.time total.pct self.time self.pct
download.php?i=KjdkSVZq#17       8.04     64.11      8.04    64.11
<no location>                    4.38     34.93      4.38    34.93
download.php?i=KjdkSVZq#16       0.06      0.48      0.06     0.48
download.php?i=KjdkSVZq#18       0.02      0.16      0.02     0.16
download.php?i=KjdkSVZq#23       0.02      0.16      0.02     0.16
download.php?i=KjdkSVZq#6        0.02      0.16      0.02     0.16

$by.line
                           self.time self.pct total.time total.pct
<no location>                   4.38    34.93       4.38     34.93
download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16

$sample.interval
[1] 0.02

$sampling.time
[1] 12.54

判断source code告诉我们,有问题的行(#17)实际上是for循环中愚蠢的if语句.相比之下,基本上没有时间使用矢量化代码(第6行)计算相同的值.

我还没有try 过任何图形输出,但到目前为止,我已经对我所得到的印象非常深刻.

R相关问答推荐

基于两个现有列创建新列

pdf Quarto中的中心美人鱼

R gtsummary tBL_summary,包含分层和两个独立分组变量

在ubuntu 22.04上更新到R4.4后包安装出现编译错误

无法运行通过R中的Auto.arima获得的ARIMA模型

从嵌套列表中智能提取线性模型系数

向gggplot 2中的数据和轴标签添加大写和星号

根据R中两个变量的两个条件删除带有dspirr的行

如果第一个列表中的元素等于第二个列表的元素,则替换为第三个列表的元素

非线性混合效应模型(NLME)预测变量的置信区间

计算满足R中条件的连续列

R中的哈密顿滤波

SHINY:使用JS函数应用的CSS样式显示HTML表格

R:用GGPLATE,如何在两个独立的变量中制作不同形状的散点图?

当每个变量值只能 Select 一次时,如何从数据框中 Select 两个变量的组合?

解析嵌套程度极高的地理数据

自定义交互作用图的标签

我正在try 创建一个接近cos(X)的值的While循环,以便它在-或+1-E10范围内

GOGPLATE geom_boxploy色彩疯狂

如何在shiny 的应用程序 map 视图宣传单中可视化单点