在Godbolt's Compiler Explorer上测试代码提供了以下解释:
- 在
-O0
或没有优化时,生成的代码调用C库函数strlen
;
- 在
-O1
时,生成的代码使用rep scasb
指令进行简单的内联扩展;
- 在
-O2
及以上,生成的代码使用更精细的内联扩展.
反复对代码进行基准测试表明,不同运行之间存在很大差异,但增加迭代次数表明:
-O1
代码比C库实现慢得多:32240
对3090
-O2
代码比-O1
快,但仍然比图书馆代码慢得多:8570
比3090
.
这种行为是gcc
和GNU libc特有的.clang
和苹果的Libc在OS/X上进行的相同测试没有显示出显著的差异,这并不令人惊讶,因为Godbolt显示clang
在所有优化级别都会生成对C库strlen
的调用.
这可能被认为是gcc/glibc中的一个缺陷,但更广泛的基准测试可能表明,调用strlen
的开销比小字符串的内联代码缺乏性能有更重要的影响.基准测试中的字符串非常大,因此将基准测试集中在超长字符串上可能不会产生有意义的结果.
我改进了这个基准并测试了各种字符串长度.从运行在Intel(R)Core(TM)i3-2-O1
CPU@3.10 GHz的Linux(Debian4.7.2-5)4.7.2上的基准测试中可以看出,-O1
时生成的内联代码总是较慢,对于中等长度的字符串,速度高达10倍,而对于非常短的字符串,-O2
只比libcstrlen
快一点,而对于较长的字符串,-O2
的速度只有libcstrlen
的一半.从这个数据来看,GNUC库版本strlen
对于大多数字符串长度来说是相当有效的,至少在我的特定硬件上是这样.还要记住,缓存对基准测量有重大影响.
以下是更新后的代码:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '\0';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
以下是输出:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call