R直方图存储计算的bin值

发布于01月05日

我有一个图形，它在x轴上有百分比，在y轴上有位置.数据帧很大(超过200万行)，所以我将点分组到10k个箱中，并绘制出箱的平均百分比.我使用的代码如下:

ggplot(data, aes(norm_location, percentage, colour = class)) +
  stat_summary_bin(fun = "mean", 
                   geom="point", 
                   bins = 10000) +

由于数据的大小，每次我需要更改与图形相关的某些内容(标题、轴名称、 colored颜色等)时，运行此代码需要很长时间.有没有方法将箱的值存储在一个有10k行的较小数据框中，并使用它来代替 Big Data ？怎样才能更有效地生成这个图呢？

谢谢!

推荐答案

如果我理解正确的话，您希望在绘制之前汇总数据以获得更快的速度.对我来说，这听起来相当合理.此外，如果你乐于使用data table，你可以从它的多线程功能中受益.

#sample data
data= data.table(norm_location = 1:2000000, percentage= runif(2000000, min=0, max=1), class = sample(c('apple','pear','orange'), 2000000, replace=T))

#manully slice 10000 bins and binwidth, and then plot only points of each bin
data[, binwidth := (max(norm_location)-min(norm_location))/10000, by = class][, 
.(percentage=mean(percentage)), by=.((norm_location-min(norm_location))%/%binwidth, class, binwidth)][, 
          norm_location:= norm_location*binwidth, by=class]%>%
  ggplot(aes(norm_location, percentage, colour = class)) +
  geom_point()