Rust x8664 的缓存填充大小应该是 128 字节吗

发布于05月05日

我找到了crossbeam条 comments .

从Intel的Sandy Bridge开始，spatial prefetcher现在一次提取两对64字节的缓存线，因此我们必须对齐到128字节，而不是64字节.

资料来源:

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

https://github.com/facebook/folly/blob/1b5288e6eea6df074758f877c849b6e73bbb9fbb/folly/lang/Align.h#L107

我没有在英特尔的手册中找到这样的说法.但在最新提交之前，folly仍然使用128字节的填充，这让我很信服.所以我开始写代码，看看是否能观察到这种行为.这是我的密码.

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Compiler Explorer

RY00X是我的CPU.当索引为0、1、2、3时，需要约1.2秒才能完成.当指数为0、16、32、48时，需要大约200毫秒才能完成.当索引为0、32、64、96时，需要大约200毫秒才能完成，这与之前完全相同.我还在一台英特尔机器上测试了它们，它给了我类似的结果.

从这个微型工作台上，我看不出为什么要使用128字节填充而不是64字节填充.我做错什么了吗？

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out Performance counter stats for './a.out' (25 runs): 560.22 msec task-clock # 3.958 CPUs utilized ( +- 0.12% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 126 page-faults # 224.752 /sec ( +- 0.35% ) 2,180,391,747 cycles # 3.889 GHz ( +- 0.12% ) 2,003,039,378 instructions # 0.92 insn per cycle ( +- 0.00% ) 1,604,118,661 uops_issued.any # 2.861 G/sec ( +- 0.00% ) 2,003,739,959 uops_executed.thread # 3.574 G/sec ( +- 0.00% ) 494 machine_clears.memory_ordering # 881.172 /sec ( +- 9.00% ) 0.141534 +- 0.000342 seconds time elapsed ( +- 0.24% )

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out Performance counter stats for './a.out' (25 runs): 560.01 msec task-clock # 3.957 CPUs utilized ( +- 0.13% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 124 page-faults # 221.203 /sec ( +- 0.16% ) 2,180,048,243 cycles # 3.889 GHz ( +- 0.13% ) 2,003,038,553 instructions # 0.92 insn per cycle ( +- 0.00% ) 1,604,084,990 uops_issued.any # 2.862 G/sec ( +- 0.00% ) 2,003,707,895 uops_executed.thread # 3.574 G/sec ( +- 0.00% ) 22 machine_clears.memory_ordering # 39.246 /sec ( +- 9.68% ) 0.141506 +- 0.000342 seconds time elapsed ( +- 0.24% )

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out Performance counter stats for './a.out' (25 runs): 809.98 msec task-clock # 3.835 CPUs utilized ( +- 0.42% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 122 page-faults # 152.953 /sec ( +- 0.22% ) 3,152,973,230 cycles # 3.953 GHz ( +- 0.42% ) 2,003,038,681 instructions # 0.65 insn per cycle ( +- 0.00% ) 2,868,628,070 uops_issued.any # 3.596 G/sec ( +- 0.41% ) 2,934,059,729 uops_executed.thread # 3.678 G/sec ( +- 0.30% ) 10,810,169 machine_clears.memory_ordering # 13.553 M/sec ( +- 0.90% ) 0.21123 +- 0.00124 seconds time elapsed ( +- 0.59% )

#include <thread> alignas(128) volatile int counter[1024]{}; void update(int idx) { for (int j = 0; j < 100000000; j++) ++counter[idx]; } static const int stride = SIZE/sizeof(counter[0]); int main() { std::thread t1(update, 0*stride); std::thread t2(update, 1*stride); std::thread t3(update, 2*stride); std::thread t4(update, 3*stride); t1.join(); t2.join(); t3.join(); t4.join(); }

Rust x8664 的缓存填充大小应该是 128 字节吗

推荐答案

性能计数器揭示了一个差异，即使与你的基准

相邻线路，500+-300机器清除

与128字节分隔的vs相比，只有极少数机器清除

与一行内的实际错误共享相比:10万台机器清除

改进的基准测试-调整数组，并允许优化

Rust相关问答推荐

Tauri tauri—apps/plugin—store + zustand

使用模块中的所有模块，但不包括特定模块

integer cast as pointer是什么意思

如何编写一个以一个闭包为参数的函数，该函数以另一个闭包为参数？

替换可变引用中的字符串会泄漏内存吗？

Tokio_Postgres行上未显示退回特性的生存期，且生命周期不够长

在本例中，为什么我不能一次多次borrow 可变变量？

JSON5中的变量类型(serde)

程序在频道RX上挂起

一次不能多次borrow *obj作为可变对象

`use` 和 `crate` 关键字在 Rust 项目中效果不佳

为什么是&mut发送？线程如何在安全的 Rust 中捕获 &mut？

Sized问题的动态调度迭代器Rust

如何将 Rust 字符串转换为 i8(c_char) 数组？

哪些特征通过 `Deref` 而哪些不通过？

Rust 将特性传递给依赖项

Rust 中 `Option` 的内存开销不是常量

字符串切片的向量超出范围但原始字符串仍然存在，为什么判断器说有错误？

Abortable：悬而未决的期货？

Iterator：：collect如何进行转换？