您的循环根本没有优化,因为strict FP math is the default.XMM0是128位寄存器,YMM0是对应的256位寄存器.vaddsd
是加标量Double,使用XMM0的LOW元素.https://felixcloutier.com/x86/addsd
使用clang -O3 -ffast-math -march=native
,让它对and unroll进行矢量化(4倍),以获得16倍的加速比,AVX和并行级并行(wikipedia / Modern Microprocessors
A 90-Minute Guide!)各4倍,数组足够小,不会成为L3缓存带宽的瓶颈. (另一个大约2倍的性能可用于适合L1 d的数组,而不仅仅是L2,例如#pragma clang loop interleave_count(8)
可以展开更多,因为您已经缓存阻止的代码通常会在L1 d缓存中获得命中.
你的Raptor Lake CPU有两个全流水线的向量-FP加法单元,流水线长度为3(在结果准备好作为另一个加法的输入之前的延迟周期).这个答案包括我在i7-6700k Skylake上的结果,除了FP-ADD流水线有4个周期延迟外,i7-6700k Skylake有相同的结果.
@Jérôme Richard comments 说,NumPy只是对FP数组的总和进行标量成对求和,这比纯朴素序列获得了一些ILP. 可以确定,如果你在DRAM带宽上进行了判断. 一个好处是在ISA和可用的SIMD功能之间的数值一致性,通过不使用它们来实现.
您正在寻找vaddpd ymm0, ymm0, [rdi]
(256位向量上打包的双精度).GCC将用103来做这件事,这使得它可以假装FP数学运算是相加的,从而改变舍入误差.(For the better in this case,例如,如果您与long double
和或Kahan误差补偿和进行比较;这是朝着与成对求和相同的思想方向迈出的一步.)另见https://gcc.gnu.org/wiki/FloatingPointMath
gcc -O3 -march=native -ffast-math foo.c
这提供了大约4倍的加速比,因为FP ALU延迟(您的CPU上每3个周期1个向量而不是1个标量)仍然是比L3带宽更糟糕的瓶颈,绝对比L2缓存带宽更差.
SIZE=4000000
乘以sizeof(double)
是30.52 MiB,因此它将适合您的高端Raptor Lake的36MiB L3缓存.但要想运行得更快,你需要减少SIZE
,增加REPS
(也许还会把重复循环inside设为一个定时区).整个程序很短,只有perf stat
,在我的i7-6700k Skylake和DDR4-2666上不到SIZE=4000000
毫秒,其中大部分是启动的.用clock()
而不是clock_gettime
来计时也是相当短的.
您的每核缓存大小为48 KiB L1d、2 MiB L2(在Golden Cove P-core上,在单个Gracemont E-core上更少/更多).https://en.wikipedia.org/wiki/Raptor_Lake/https://chipsandcheese.com/2022/08/23/a-preview-of-raptor-lakes-improved-l2-caches/.SIZE=6144
将使数组达到L1d的完整大小.如果我们的目标是仅为数组提供40 KiB的空间,为其他产品留出空间,则为SIZE=5120
.最好将其与aligned_alloc
对齐32字节,这样我们就可以在每个时钟周期以3个向量(96字节)从L1D高速缓存中读取它,而不是让高速缓存线每隔一个向量分割.(https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove//https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-throughput-limit/https://uops.info/)
To get anywhere near peak FLOPS (within a factor of 2 since we're not using FMA), we need to run 2 vaddpd
instructions every clock cycle. But it has a latency of 3 cycles on your Golden Cove P-cores (Alder/Raptor Lake), so the latency * bandwidth
product is 6 vaddpd
in flight at once. That's the minimum number of dependency chains, preferably at least 8. Anything less will leave loop-carried dependency chains as the bottleneck, not throughput.
(Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators))
因此,您要在内部循环中寻找额外的指令,如vaddpd ymm1, ymm1, [rdi+32]
.Golden Cove的3c延迟/0.5c倒数吞吐量vaddps/pd
是由于专用的SIMD-FP加法ALU,而不是用于MUL/FMA执行单元的4周期流水线,后者自Skylake以来也被用于加/减.Unlike Haswell,Golden Cove(Alder/Raptor Lake P-core)有两个这样的ALU,因此吞吐量仍然与FMA一样好.
GCC的-funroll-loops
在这里没有用处,它展开循环,但仍然只有一个累加器向量.(即使是#pragma omp simd reduction (+:sum)
和-fopenmp
.)Clang will unroll with 4 accumulators by default.如果是-march=raptorlake
,它将展开20,但仍然只有4个累加器,所以这样做会将每个向量加5.并使用indexed addressing种模式,如[rdi + 8*rcx + 32]
,因此每个vaddpd ymm, ymm, [reg+reg*8]
个非叠层为2个uop,而不是尽可能地降低前端成本.只涉及一个数组,所以使用指针增量而不是索引甚至不会有任何额外的成本,它不会做任何聪明的事情,比如相对于数组末尾的索引,使用一个倒数到零的负索引.但这并不是一个瓶颈;Golden Cove的宽前端(6个uop)每个周期可以发出3条这样的vaddpd [mem+idx]
指令,因此保持领先于后端(2个/时钟).即使是4英寸宽的天湖也能跟上这种有限的展开.
101 before the 102 gets clang to unroll with more accumulators.(对于8
个以上的数,它会忽略它并只执行4:/)对于您期望获得L1d命中的代码来说,这可能只是一个好主意;如果您希望您的数组需要来自L2或更远的地方,则默认为好.当然,在这种情况下,展开的非交错部分也是对代码大小的浪费,如果n
不是编译时常量,还会花费更多的清理代码.doctor 人数:https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations人
默认情况下,没有杂注,而是-O3 -ffast-math -march=native
(在Skylake上也使用-mbranches-within-32B-boundaries
),我们得到与Clang在Raptor Lake上使用的相同的展开20和交错的4个累加器.(它还完全展开REPS
计时/打印循环,因此这个大循环重复5次.这几乎肯定比花费1个寄存器和几条指令来回收缓存中已经很热的代码更糟糕.)
# clang 16 no pragma, unrolls by 20 with 4 accumulators
inner_loop_top:
1360: c5 fd 58 84 cb a0 fd ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0x260]
1369: c5 f5 58 8c cb c0 fd ff ff vaddpd ymm1,ymm1,[rbx+rcx*8-0x240]
1372: c5 ed 58 94 cb e0 fd ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0x220]
137b: c5 e5 58 9c cb 00 fe ff ff vaddpd ymm3,ymm3, [rbx+rcx*8-0x200]
1384: c5 fd 58 84 cb 20 fe ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0x1e0]
... ymm1, ymm2
139f: c5 e5 58 9c cb 80 fe ff ff vaddpd ymm3,ymm3,[rbx+rcx*8-0x180]
... 2 more copies of ymm0..3, ending with the next insn, the first to use a 1-byte disp8
13e7: c5 e5 58 5c cb 80 vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]
13ed: c5 fd 58 44 cb a0 vaddpd ymm0,ymm0, [rbx+rcx*8-0x60]
13f3: c5 f5 58 4c cb c0 vaddpd ymm1,ymm1, [rbx+rcx*8-0x40]
13f9: c5 ed 58 54 cb e0 vaddpd ymm2,ymm2, [rbx+rcx*8-0x20]
13ff: c5 e5 58 1c cb vaddpd ymm3,ymm3, [rbx+rcx*8]
1404: 48 83 c1 50 add rcx,0x50
1408: 48 81 f9 ec 0f 00 00 cmp rcx,0xfec
140f: 0f 85 4b ff ff ff jne 1360 <main+0x80>
与杂注相比,当内联为main
时,展开16,带有8个累加器.4000
不是16x4的倍数,因此循环退出条件位于循环中间的8个加法组之间.
# clang 16 with pragma, unrolls by 16 with 8 accumulators
inner_loop_top:
13f0: c5 fd 58 84 cb 20 fe ff ff vaddpd ymm0,ymm0,[rbx+rcx*8-0x1e0]
13f9: c5 f5 58 8c cb 40 fe ff ff vaddpd ymm1,ymm1,[rbx+rcx*8-0x1c0]
1402: c5 ed 58 94 cb 60 fe ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0x1a0]
140b: c5 e5 58 9c cb 80 fe ff ff vaddpd ymm3,ymm3, [rbx+rcx*8-0x180]
1414: c5 dd 58 a4 cb a0 fe ff ff vaddpd ymm4,ymm4,[rbx+rcx*8-0x160]
141d: c5 d5 58 ac cb c0 fe ff ff vaddpd ymm5,ymm5, [rbx+rcx*8-0x140]
1426: c5 cd 58 b4 cb e0 fe ff ff vaddpd ymm6,ymm6,[rbx+rcx*8-0x120]
142f: c5 c5 58 bc cb 00 ff ff ff vaddpd ymm7,ymm7, [rbx+rcx*8-0x100]
1438: 0f 1f 84 00 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0] # JCC erratume workaround
1440: 48 81 f9 bc 0f 00 00 cmp rcx,0xfbc
1447: 0f 84 33 ff ff ff je 1380 <main+0x60>
144d: c5 fd 58 84 cb 20 ff ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0xe0]
1456: c5 f5 58 8c cb 40 ff ff ff vaddpd ymm1,ymm1, [rbx+rcx*8-0xc0]
145f: c5 ed 58 94 cb 60 ff ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0xa0]
1468: c5 e5 58 5c cb 80 vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]
146e: c5 dd 58 64 cb a0 vaddpd ymm4,ymm4, [rbx+rcx*8-0x60]
1474: c5 d5 58 6c cb c0 vaddpd ymm5,ymm5, [rbx+rcx*8-0x40]
147a: c5 cd 58 74 cb e0 vaddpd ymm6,ymm6, [rbx+rcx*8-0x20]
1480: c5 c5 58 3c cb vaddpd ymm7,ymm7, [rbx+rcx*8]
1485: 48 83 c1 40 add rcx,0x40
1489: e9 62 ff ff ff jmp 13f0 <main+0xd0>
我try 更改源代码以鼓励编译器递增指针,但clang没有领会其中的暗示,而是在它使用的寄存器(如[rdi + r8*8 + 0x20]
)中发明了一个索引计数器
const double * endp = array+size;
#pragma clang loop interleave_count(8)
while (array != endp) { // like a C++ range-for
sum += *array++; // no benefit, clang pessimizes back to an index
}
更新的微基准源代码
// #define SIZE 5120 // 40 KiB, fits in Raptor Lake's 48KiB
#define SIZE 4000 // fits in SKL's 32KiB L1d cache
#define REPS 5
...
double *array = aligned_alloc(32, array_size * sizeof(double));
// double *array = malloc(array_size * sizeof(double));
...
double sum_array(const double *array, long size) {
double sum = 0.0;
//#pragma clang loop interleave_count(8) // uncomment this, optionally
for (size_t i = 0; i < size; ++i) {
sum += array[i];
}
return sum;
}
int main() {
double *xs = make_random_array(SIZE);
if (xs == NULL) return 1;
const int inner_reps = 1000000; // sum the array this many times each timed interval
for (int i = 0; i < REPS; i++) {
clock_t start_time = clock();
volatile double r; // do something with the sum even when we don't print
for (int i=0 ; i<inner_reps ; i++){ // new inner loop
r = sum_array(xs, SIZE);
// asm(""::"r"(xs) :"memory"); // forget about the array contents and redo the sum
// turned out not to be necessary, clang is still doing the work
}
clock_t end_time = clock();
double dt = (double)(end_time - start_time) / (CLOCKS_PER_SEC * inner_reps);
printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
}
free(xs);
return 0;
}
在每个时间间隔内添加了const int inner_reps = 1000000;
个重复的求和计数,以及一些确保优化器不会失败的措施(Godbolt -也将SIZE
减少到4000
以适应我的32 KiB L1 d),在我的Skylake上,4.2 GHz,我得到了预期的16倍加速.
GCC 13.2.1,在Arch GNU/LINUX,内核6.5上Clang 16.0.6
# Without any vectorization
$ gcc -O3 -march=native -Wall arr-sum.c
taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_single -r1 ./a.out 1057.69 MFLOPS (2003.09)
1059.17 MFLOPS (2003.09)
1059.67 MFLOPS (2003.09)
1060.30 MFLOPS (2003.09)
1060.34 MFLOPS (2003.09)
... perf results below
# with 1 vector accumulator
$ gcc -O3 -march=native -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
4389.68 MFLOPS (2003.09)
4389.32 MFLOPS (2003.09)
4381.48 MFLOPS (2003.09)
4393.57 MFLOPS (2003.09)
4389.98 MFLOPS (2003.09)
... perf results below
# unrolled by 4 vectors
$ clang -O3 -march=native -ffast-math -Wall arr-sum.c # clang unrolls by default
$ taskset -c 1 perf stat ... a.out
17048.41 MFLOPS (2003.09)
17072.49 MFLOPS (2003.09)
17060.55 MFLOPS (2003.09)
17081.02 MFLOPS (2003.09)
17099.79 MFLOPS (2003.09)
... perf results below, but including:
2,303,995,395 idq.mite_uops # 1.965 G/sec
# suffering from the JCC erratum in the inner loop; avoid it:
$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
17013.53 MFLOPS (2003.09)
17061.79 MFLOPS (2003.09)
17064.99 MFLOPS (2003.09)
17109.44 MFLOPS (2003.09)
17001.74 MFLOPS (2003.09)
... perf results below; summary: 1.17 seconds
4,905,130,231 cycles # 4.178 GHz
5,941,872,098 instructions # 1.21 insn per cycle
5,165,165 idq.mite_uops # 4.399 M/sec
5,015,000,000 fp_arith_inst_retired.256b_packed_double # 4.271 G/sec
# With #pragma clang loop interleave_count(8) in the source
# for unrolling by 8 instead of 4
$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
28505.05 MFLOPS (2003.09)
28553.48 MFLOPS (2003.09)
28556.13 MFLOPS (2003.09)
28597.37 MFLOPS (2003.09)
28548.18 MFLOPS (2003.09)
# imperfect scheduling and a front-end bottleneck from clang's bad choice of addressing-mode
# means we don't get another 2x over the default.
(对于perf stat -d
,我还确认了L1d缓存未命中率低于1%.使用更大的数组大小,比如可以放入Skylake的256K二级缓存中的20000
,但不能放入L1d,我仍然可以获得相当接近每个时钟1个向量的吞吐量.)
JCC erratum workaround(仅限Skylake系列,不是您的CPU)在这种情况下提供了微不足道的进一步加速,即使使用传统的解码,前端也不是瓶颈:问题是发生了解层,所以解码器不会被2-uop指令卡住.而uops_issued.any
的平均吞吐量仍然只有2.18/时钟(4倍展开).
So we get a factor of 16 speedup on Skylake from vectorizing with AVX (4x) and instruction-level parallelism of 4 accumulators.这仍然只是平均每个时钟周期略好于1vaddpd
(因为在重复循环迭代之间存在ILP),但是clang的4个dep链仅为Skylake的4个周期延迟x 2insn/周期吞吐量=8个fp数学指令的一半.
展开4将留下另一个剩下的2个性能因素(对于Skylake,对于Alder Lake和以后的更少).最新消息:我们用pragma
美元买到了大部分).但是,只有当数据在L1d缓存中处于热状态时,通过仔细的缓存阻塞,或者如果您在寄存器中对数据做更多的工作(更高的计算强度,而不是每次加载只增加1次),才能实现这一点.要获得另一个完整的2倍,还需要一个知道Sandybridge家族解层的优化器,而clang的显然不是.clang默认的4个累加器似乎是合理的,更多的累加器将意味着更多的初始化和清理工作,尽管只有4个累加器就展开20似乎太多了,这就像是浪费I-缓存/uop-缓存占用空间.
性能计数器结果
仅在i7-6700k Skylake(EPP=性能)上的用户空间计数,使用Linux内核&;perf 6.5.这是针对整个过程的,包括启动,但内部重复计数为100万意味着其总时间的绝大部分花在了我们关心的循环中,而不是启动.
Scalar loop:
Performance counter stats for './a.out' (GCC O3-native without fast-math):
18,902.70 msec task-clock # 1.000 CPUs utilized
54 context-switches # 2.857 /sec
0 cpu-migrations # 0.000 /sec
72 page-faults # 3.809 /sec
79,099,401,032 cycles # 4.185 GHz
35,069,666,963 instructions # 0.44 insn per cycle
30,109,096,046 uops_issued.any # 1.593 G/sec
50,096,899,159 uops_executed.thread # 2.650 G/sec
46,353,551 idq.mite_uops # 2.452 M/sec
0 fp_arith_inst_retired.256b_packed_double # 0.000 /sec
18.902876984 seconds time elapsed
18.893778000 seconds user
0.000000000 seconds sys
请注意,fp_arith_inst_retired.256b_packed_double
-无256位SIMD指令的计数为0.
Vectorized but not unrolled:
Performance counter stats for './a.out' (GCC O3-native-fast-math):
4,559.54 msec task-clock # 1.000 CPUs utilized
8 context-switches # 1.755 /sec
0 cpu-migrations # 0.000 /sec
74 page-faults # 16.230 /sec
19,093,881,407 cycles # 4.188 GHz
20,060,557,627 instructions # 1.05 insn per cycle
15,094,070,341 uops_issued.any # 3.310 G/sec
20,075,885,996 uops_executed.thread # 4.403 G/sec
12,015,692 idq.mite_uops # 2.635 M/sec
5,000,000,000 fp_arith_inst_retired.256b_packed_double # 1.097 G/sec
4.559770793 seconds time elapsed
4.557838000 seconds user
0.000000000 seconds sys
Vectorized, unrolled by 20 with 4 accumulators:
Performance counter stats for './a.out': (Clang -O3-native-fast-math JCC-workaround)
1,174.07 msec task-clock # 1.000 CPUs utilized
5 context-switches # 4.259 /sec
0 cpu-migrations # 0.000 /sec
72 page-faults # 61.325 /sec
4,905,130,231 cycles # 4.178 GHz
5,941,872,098 instructions # 1.21 insn per cycle
10,689,939,125 uops_issued.any # 9.105 G/sec
10,566,645,887 uops_executed.thread # 9.000 G/sec
5,165,165 idq.mite_uops # 4.399 M/sec
5,015,000,000 fp_arith_inst_retired.256b_packed_double # 4.271 G/sec
1.174507232 seconds time elapsed
1.173769000 seconds user
0.000000000 seconds sys
请注意略多于256位的向量指令:即在将水平求和运算降低到1标量之前,将4个累加器减少到1,即3xvaddpd
.(它从高半部分的vextractf128
开始,然后使用128位向量指令.因此,此计数器不计算它们,但它们仍然与开始的下一次迭代的工作竞争.)
Vectorized, unrolled by 16 with 8 accumulators:
Performance counter stats for './a.out' (clang -O3 native fast-math #pragma ... interleave_count(8)
):
701.30 msec task-clock # 0.999 CPUs utilized
3 context-switches # 4.278 /sec
0 cpu-migrations # 0.000 /sec
71 page-faults # 101.241 /sec
2,931,696,392 cycles # 4.180 GHz
6,566,898,298 instructions # 2.24 insn per cycle
11,249,046,508 uops_issued.any # 16.040 G/sec
11,019,891,003 uops_executed.thread # 15.714 G/sec
3,153,961 idq.mite_uops # 4.497 M/sec
5,035,000,000 fp_arith_inst_retired.256b_packed_double # 7.180 G/sec
0.701728321 seconds time elapsed
0.701217000 seconds user
0.000000000 seconds sys
循环后的更多清理工作,7x vaddpd
以降至1
向量.而不是2倍的加速,而是在16.040 uops / 4.180 GHz
~=3.87平均每时钟发出的uop数上遇到瓶颈,大多数周期发出Skylake的最大值为4.这是因为Clang/LLVM不知道如何使用索引寻址模式来调整Intel CPU.(执行的uop实际上比发出的uop少lower次,因此与ALU的负载微融合非常少,在每次迭代之前对8个向量进行8xvxorps
的零化,这些迭代需要一个发出槽,但不需要后端执行单元.)
7.180 / 4.18 GHz
=每个时钟周期平均执行1.71条256位FP指令.
(CPU可能一直以4.20 GHz运行,但该频率是周期计数(仅限用户空间)除以任务时钟得出的.在内核中花费的时间(页面错误和中断)没有计算在内,因为我们使用了perf stat --all-user
)
加入时间:清华2007年01月25日下午3:33
通过避免索引寻址模式修复前端瓶颈可从1.71提高到1.81 vaddpd
/时钟.(不是2.0,因为不完美的uop调度会损失一个周期,没有空闲的时间来弥补.)在一个4.2GHz的单核上,这大约是30281.47 MFLOP/S.
作为起点,我在带有展开杂注的C版本上使用了clang -O3 -fno-unroll-loops -S -march=native -ffast-math -Wall arr-sum.c -masm=intel -o arr-sum-asm.S
,因此that循环仍然使用8个累加器展开,但只展开8而不是16.
外部的REPEAT循环保持卷起,所以我只需手动编辑ASM循环的一个副本(内联到Main中).几条指令上的ds
个前缀是为了解决JCC错误.请注意,这些指令都不需要disp32寻址模式,因为我将指针增量放在了正确的位置,以便从Full-0x80中受益.+0x7f,实际上是从-0x80到+0x60.因此,机器代码的大小比clang的要小得多,指令长度为5(或[rdi+0]
的指令长度为4).add
最终需要一台imm32,但只有一台.关键的是,它们保持微融合,将前端uop带宽削减了近一半.
vxorpd xmm0, xmm0, xmm0
...
vxorpd xmm7, xmm7, xmm7 # compiler-generated sumvec = 0
mov ecx, 4000 / (4*8) # loop trip-count
mov rdi, rbx # startp = arr
.p2align 4, 0x90
.LBB2_7: # Parent Loop BB2_5 Depth=1
# Parent Loop BB2_6 Depth=2
# => This Inner Loop Header: Depth=3
ds vaddpd ymm0, ymm0, [rdi + 32*0]
vaddpd ymm1, ymm1, [rdi + 32*1]
vaddpd ymm2, ymm2, [rdi + 32*2]
ds vaddpd mm5, ymm5, [rdi + 32*3]
add rdi, 256
vaddpd ymm3, ymm3, [rdi - 32*4]
ds vaddpd ymm6, ymm6, [rdi - 32*3]
vaddpd ymm7, ymm7, [rdi - 32*2]
vaddpd ymm4, ymm4, [rdi - 32*1]
dec rcx # not spanning a 32B boundary
jne .LBB2_7
# %bb.8: # in Loop: Header=BB2_6 Depth=2
vaddpd ymm0, ymm1, ymm0
vaddpd ymm1, ymm5, ymm2
... hsum
$ taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_double -r1 ./a.out
30281.47 MFLOPS (2003.09)
30057.33 MFLOPS (2003.09)
30138.64 MFLOPS (2003.09)
30160.00 MFLOPS (2003.09)
29979.61 MFLOPS (2003.09)
Performance counter stats for './a.out':
664.79 msec task-clock # 0.999 CPUs utilized
3 context-switches # 4.513 /sec
0 cpu-migrations # 0.000 /sec
73 page-faults # 109.809 /sec
2,775,830,392 cycles # 4.176 GHz
7,007,878,485 instructions # 2.52 insn per cycle
6,457,154,731 uops_issued.any # 9.713 G/sec
11,378,180,211 uops_executed.thread # 17.115 G/sec
3,634,644 idq.mite_uops # 5.467 M/sec
5,035,000,000 fp_arith_inst_retired.256b_packed_double # 7.574 G/sec
0.665220579 seconds time elapsed
0.664698000 seconds user
0.000000000 seconds sys
uops_issued.any
现在大约是2.32个/周期,有足够的空间用于uop-缓存获取和其他前端瓶颈.
在10点而不是8点展开只会带来tiny%的加速,比如662.49毫秒的良好运行总时间,以及30420.80的最佳MFLOPS.IPC约为2.41.
在使FP管道饱和之前,L1D缓存带宽是SKL上的最后一个瓶颈,不能完全支持2x 32字节加载/时钟.更改INSN中的3个以将寄存器加到自身(sum7 += sum7
)可将10累加器版本的总时间加快至619.11毫秒,最佳MFLOPS为32424.90,2.58IPC,平均为1.957 256位微指令/时钟.(在清理过程中,FP端口必须与几个128位加法器竞争.)
猛龙湖可以做3个负载/时钟,即使对于矢量,所以应该不会有问题.