我一直在使用stringr
,因为它应该更快,但我今天发现它在处理因子项时要慢得多.我没有看到任何警告表明情况会如此,也没有看到为什么会如此.
例如:
string_options = c("OneWord", "TwoWords", "ThreeWords")
sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)
正如预期的那样,当处理character
个术语时,base R比stringr
慢.但当处理factor
个术语时,Base R速度快了30倍.
bench::mark(
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two")
)
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#1 base_chars 116.1ms 116.38ms 8.58 3.81MB 0 5 0 583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]> <tibble>
#2 stringr_chars 86.04ms 88.2ms 11.3 3.81MB 0 6 0 532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]> <tibble>
#3 base_facts 3.59ms 3.65ms 271. 11.44MB 0 136 0 501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts 90.71ms 91.29ms 10.9 11.44MB 0 6 0 549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]> <tibble>
看起来stringr
对factor
项没有任何不同,但Base R正在显着优化它.这是预期行为吗?我应该将其报告为stringr
问题吗?我是否完全缺少stringr
个设置?我不想考虑数据的格式来确定我是使用stringr
还是以R为基础.