使用 dplyr 窗口函数计算百分位数

发布于05月28日

我有一个可行的解决方案，但我正在寻找一个更干净、更可读的解决方案，它可能会利用一些较新的dplyr窗口函数.

使用mtcars数据集，如果我想查看第25、50、75百分位，以及每加仑英里数("mpg")的平均值和计数(按气缸数("cyl")，我使用以下代码:

library(dplyr)
library(tidyr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution 
mtcars %>% group_by(cyl) %>% 
  do(data.frame(p=p, stats=quantile(.$mpg, probs=p), 
                n = length(.$mpg), avg = mean(.$mpg))) %>%
  spread(p, stats) %>%
  select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
#       the format in which I'd like to see it, but are not critical

有没有一种方法可以让dplyr使用一些摘要函数(n_tiles、percent_rank等)更清晰地实现这一点？我所说的干净，是指没有"做"的陈述.

非常感谢.

推荐答案

在dplyr 1.0中，summarise可以返回多个值，允许以下操作:

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>%  
  summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),
            mpg = quantile(mpg, c(0.25, 0.5, 0.75)))

或者，您可以通过使用enframe避免使用单独的行来命名分位数:

mtcars %>% 
  group_by(cyl) %>%  
  summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))

    cyl quantile   mpg
  <dbl> <chr>    <dbl>
1     4 25%       22.8
2     4 50%       26  
3     4 75%       30.4
4     6 25%       18.6
5     6 50%       19.7
6     6 75%       21  
7     8 25%       14.4
8     8 50%       15.2
9     8 75%       16.2

Answer for previous versions of 100

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>% 
  unnest(x)

    cyl quantiles   mpg
1     4       25% 22.80
2     4       50% 26.00
3     4       75% 30.40
4     6       25% 18.65
5     6       50% 19.70
6     6       75% 21.00
7     8       25% 14.40
8     8       50% 15.20
9     8       75% 16.25

可以使用tidyeval将其转换为更通用的函数:

q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {

  groups=enquos(...)
  
  data %>% 
    group_by(!!!groups) %>% 
    summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>% 
    unnest(x)
}

q_by_group(mtcars, mpg)
q_by_group(mtcars, mpg, cyl)
q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75))
q_by_group(iris, Petal.Width, Species)