如果速度成为了大量行的问题,那么处理迭代计算的最快方法可能是通过Rcpp
.
实际上,你需要一个累加求和函数,该函数将每一天的供需结果加到最后一个总数上,如果结果为负数,则将其置零.下面是一个cumnominus
试用函数,它给出了正确的表,可以在dplyr
中使用:
library(dplyr)
cumnominus <- Rcpp::cppFunction("NumericVector cumnominus(NumericVector x) {
int n = x.size();
NumericVector sumout(n);
sumout[0] = (x[0] < 0) ? 0 : x[0];
for(int i = 1; i < n; i++) {
sumout[i] = (x[i] < 0) ? 0 : x[i] + sumout[i - 1];
}
return sumout;
}")
Inventory |>
mutate(In = Supply,
End = cumnominus(Supply - Demand),
Start = lag(End, default = 0),
Short = pmax(0, Demand - (Start + Supply)),
Out = pmin(Demand, Start + In)) |>
select(Age, Demand, Supply, Start, In, Out, End, Short)
#> Age Demand Supply Start In Out End Short
#> 1 90 0 4000 0 4000 0 4000 0
#> 2 120 5000 50 4000 50 4050 0 950
#> 3 270 25 4000 0 4000 25 3975 0
#> 4 365 5000 300 3975 300 4275 0 725
#> 5 Inf 10 0 0 0 0 0 10
Result
#> Age Demand Supply Start In Out End Short
#> 1 90 0 4000 0 4000 0 4000 0
#> 2 120 5000 50 4000 50 4050 0 950
#> 3 270 25 4000 0 4000 25 3975 0
#> 4 365 5000 300 3975 300 4275 0 725
#> 5 Inf 10 0 0 0 0 0 10
作为对5M行数据帧上仅R循环的一点测试,与8.5s的R循环相比,它需要大约0.05秒:
cumnominus_r <- function(x) {
out_sum <- integer(length(x))
out_sum[1] <- max(0, x[1])
for (i in 2:length(x)) {
out_sum[i] <- ifelse(x[i] < 0, 0, out_sum[i - 1] + x[i])
}
out_sum
}
big_df <- tibble(
Demand = sample(seq(1000, 6000, 500), 5000000, replace = TRUE),
Supply = sample(seq(1000, 6000, 500), 5000000, replace = TRUE)
)
bench::mark(
Rcpp_fun = big_df |>
mutate(End = cumnominus(Supply - Demand)),
R_only_fun = big_df |>
mutate(End = cumnominus_r(Supply - Demand))
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 Rcpp_fun 43.15ms 52.95ms 16.1 77.7MB 8.94
#> 2 R_only_fun 8.59s 8.59s 0.116 95.4MB 20.8