下面的代码根据$Weight列中的值更改$type列中的值.
n <- 1e3; m <- n*10
Treshold <- 50
wts <-runif(m)
df <- data.frame(id=seq_len(m), weight=wts * 100, type='L')
library(microbenchmark)
microbenchmark(
"df-col-row" = (df$type[df$weight < Treshold] <- "M"),
"df-row-col" = (df[df$weight < Treshold, ]$type <- "M")
)
#
#Unit: microseconds
# expr min lq mean median uq max neval
# df-col-row 80.6 87.65 145.429 89.55 104.55 5109.1 100
# df-row-col 564.9 586.10 618.496 592.40 618.90 1601.0 100
为什么第一种替代方案比第二种方案快?
Update 1
正如预期的那样,添加的列越多,差异就越大.
d9 <- data.frame(type='L', weight=wts * 100, c3=3, c4=4, c5=5, c6=6, c7=7, c8=8, c9=9)
microbenchmark(
"df-row-9col" = (d9[d9$weight < Treshold, ]$type <- "M")
)
# nit: microseconds
# expr min lq mean median uq max neval
# df-row-9col 950.1 1091.55 1267.982 1111.1 1172.45 5806 100
Update 2
在第一个备选方案中,df
被复制一次,在第二个备选方案中,被复制两次.
tracemem(df)
df$type[df$weight < Treshold] <- "M" # Alt 1.
#tracemem[0x000002c92d2b87c8 -> 0x000002c92d2b9498]: $<-.data.frame $<-
df[df$weight < Treshold, ]$type <- "M" # Alt 2.
#tracemem[0x000002c92d2b9498 -> 0x000002c92d2b9ad8]:
#tracemem[0x000002c92d2b9ad8 -> 0x000002c92d2c47d8]: [<-.data.frame [<-
untracemem(df)