我正在try 编写一个C程序,它在一组双精度数上的速度高达numpy.sum,但似乎失败了.

以下是我衡量麻木表现的方法:

import numpy as np
import time

SIZE=4000000
REPS=5

xs = np.random.rand(SIZE)
print(xs.dtype)

for _ in range(REPS):
    start = time.perf_counter()
    r = np.sum(xs)
    end = time.perf_counter()
    print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")

输出为:

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)

现在try 用C:做一些类似的事情:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE 4000000
#define REPS 5

double *make_random_array(long array_size) {
  double *array = malloc(array_size * sizeof(double));
  if (array == NULL)
    return NULL;
  srand(0);
  for (size_t i = 0; i < array_size; ++i) {
    array[i] = (double)rand() / RAND_MAX;
  }
  return array;
}

double sum_array(const double *array, long size) {
  double sum = 0.0;
  for (size_t i = 0; i < size; ++i) {
    sum += array[i];
  }
  return sum;
}

int main() {
  double *xs = make_random_array(SIZE);
  if (xs == NULL) return 1;

  for (int i = 0; i < REPS; i++) {
    clock_t start_time = clock();
    double r = sum_array(xs, SIZE);
    clock_t end_time = clock();
    double dt = (double)(end_time - start_time) / CLOCKS_PER_SEC;
    printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
  }

  free(xs);
  return 0;
}

gcc -o main -Wall -O3 -mavx main.c编译并运行它,输出是:

1850.14 MFLOPS (1999882.86)
1857.01 MFLOPS (1999882.86)
1900.24 MFLOPS (1999882.86)
1903.86 MFLOPS (1999882.86)
1906.58 MFLOPS (1999882.86)

显然,这比麻木慢得多.

根据topCPU使用率,python进程的使用率约为top%,因此看起来Numpy没有并行化任何东西.

C代码似乎使用了256位AVX寄存器(当使用-S进行编译时,xmm0上有vaddsd条指令).这似乎是最好的 Select ,因为我使用的机器似乎不支持AVX-512:

$ egrep 'model name|flags' /proc/cpuinfo  | head -n2
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

为了击败这段C代码,NumPy做了什么把戏?

推荐答案

您的循环根本没有优化,因为strict FP math is the default.XMM0是128位寄存器,YMM0是对应的256位寄存器.vaddsd是加标量Double,使用XMM0的LOW元素.https://felixcloutier.com/x86/addsd

使用clang -O3 -ffast-math -march=native,让它对and unroll进行矢量化(4倍),以获得16倍的加速比,AVX和并行级并行(wikipedia / Modern Microprocessors A 90-Minute Guide!)各4倍,数组足够小,不会成为L3缓存带宽的瓶颈. (另一个大约2倍的性能可用于适合L1 d的数组,而不仅仅是L2,例如#pragma clang loop interleave_count(8)可以展开更多,因为您已经缓存阻止的代码通常会在L1 d缓存中获得命中.

你的Raptor Lake CPU有两个全流水线的向量-FP加法单元,流水线长度为3(在结果准备好作为另一个加法的输入之前的延迟周期).这个答案包括我在i7-6700k Skylake上的结果,除了FP-ADD流水线有4个周期延迟外,i7-6700k Skylake有相同的结果.

@Jérôme Richard comments 说,NumPy只是对FP数组的总和进行标量成对求和,这比纯朴素序列获得了一些ILP. 可以确定,如果你在DRAM带宽上进行了判断. 一个好处是在ISA和可用的SIMD功能之间的数值一致性,通过不使用它们来实现.


您正在寻找vaddpd ymm0, ymm0, [rdi](256位向量上打包的双精度).GCC将用103来做这件事,这使得它可以假装FP数学运算是相加的,从而改变舍入误差.(For the better in this case,例如,如果您与long double和或Kahan误差补偿和进行比较;这是朝着与成对求和相同的思想方向迈出的一步.)另见https://gcc.gnu.org/wiki/FloatingPointMath

gcc -O3 -march=native -ffast-math  foo.c

这提供了大约4倍的加速比,因为FP ALU延迟(您的CPU上每3个周期1个向量而不是1个标量)仍然是比L3带宽更糟糕的瓶颈,绝对比L2缓存带宽更差.


SIZE=4000000乘以sizeof(double)是30.52 MiB,因此它将适合您的高端Raptor Lake的36MiB L3缓存.但要想运行得更快,你需要减少SIZE,增加REPS(也许还会把重复循环inside设为一个定时区).整个程序很短,只有perf stat,在我的i7-6700k Skylake和DDR4-2666上不到SIZE=4000000毫秒,其中大部分是启动的.用clock()而不是clock_gettime来计时也是相当短的.

您的每核缓存大小为48 KiB L1d、2 MiB L2(在Golden Cove P-core上,在单个Gracemont E-core上更少/更多).https://en.wikipedia.org/wiki/Raptor_Lake/https://chipsandcheese.com/2022/08/23/a-preview-of-raptor-lakes-improved-l2-caches/.SIZE=6144将使数组达到L1d的完整大小.如果我们的目标是仅为数组提供40 KiB的空间,为其他产品留出空间,则为SIZE=5120.最好将其与aligned_alloc对齐32字节,这样我们就可以在每个时钟周期以3个向量(96字节)从L1D高速缓存中读取它,而不是让高速缓存线每隔一个向量分割.(https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove//https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-throughput-limit/https://uops.info/)

To get anywhere near peak FLOPS (within a factor of 2 since we're not using FMA), we need to run 2 vaddpd instructions every clock cycle. But it has a latency of 3 cycles on your Golden Cove P-cores (Alder/Raptor Lake), so the latency * bandwidth product is 6 vaddpd in flight at once. That's the minimum number of dependency chains, preferably at least 8. Anything less will leave loop-carried dependency chains as the bottleneck, not throughput. (Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators))

因此,您要在内部循环中寻找额外的指令,如vaddpd ymm1, ymm1, [rdi+32].Golden Cove的3c延迟/0.5c倒数吞吐量vaddps/pd是由于专用的SIMD-FP加法ALU,而不是用于MUL/FMA执行单元的4周期流水线,后者自Skylake以来也被用于加/减.Unlike Haswell,Golden Cove(Alder/Raptor Lake P-core)有两个这样的ALU,因此吞吐量仍然与FMA一样好.

GCC的-funroll-loops在这里没有用处,它展开循环,但仍然只有一个累加器向量.(即使是#pragma omp simd reduction (+:sum)-fopenmp.)Clang will unroll with 4 accumulators by default.如果是-march=raptorlake,它将展开20,但仍然只有4个累加器,所以这样做会将每个向量加5.并使用indexed addressing种模式,如[rdi + 8*rcx + 32],因此每个vaddpd ymm, ymm, [reg+reg*8]个非叠层为2个uop,而不是尽可能地降低前端成本.只涉及一个数组,所以使用指针增量而不是索引甚至不会有任何额外的成本,它不会做任何聪明的事情,比如相对于数组末尾的索引,使用一个倒数到零的负索引.但这并不是一个瓶颈;Golden Cove的宽前端(6个uop)每个周期可以发出3条这样的vaddpd [mem+idx]指令,因此保持领先于后端(2个/时钟).即使是4英寸宽的天湖也能跟上这种有限的展开.

101 before the 102 gets clang to unroll with more accumulators.(对于8个以上的数,它会忽略它并只执行4:/)对于您期望获得L1d命中的代码来说,这可能只是一个好主意;如果您希望您的数组需要来自L2或更远的地方,则默认为好.当然,在这种情况下,展开的非交错部分也是对代码大小的浪费,如果n不是编译时常量,还会花费更多的清理代码.doctor 人数:https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

默认情况下,没有杂注,而是-O3 -ffast-math -march=native(在Skylake上也使用-mbranches-within-32B-boundaries),我们得到与Clang在Raptor Lake上使用的相同的展开20和交错的4个累加器.(它还完全展开REPS计时/打印循环,因此这个大循环重复5次.这几乎肯定比花费1个寄存器和几条指令来回收缓存中已经很热的代码更糟糕.)

# clang 16  no pragma, unrolls by 20 with 4 accumulators
inner_loop_top:
    1360:       c5 fd 58 84 cb a0 fd ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x260]
    1369:       c5 f5 58 8c cb c0 fd ff ff      vaddpd ymm1,ymm1,[rbx+rcx*8-0x240]
    1372:       c5 ed 58 94 cb e0 fd ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0x220]
    137b:       c5 e5 58 9c cb 00 fe ff ff      vaddpd ymm3,ymm3, [rbx+rcx*8-0x200]
    1384:       c5 fd 58 84 cb 20 fe ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x1e0]
  ... ymm1, ymm2
    139f:       c5 e5 58 9c cb 80 fe ff ff      vaddpd ymm3,ymm3,[rbx+rcx*8-0x180]

... 2 more copies of ymm0..3, ending with the next insn, the first to use a 1-byte disp8
    13e7:       c5 e5 58 5c cb 80       vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]

    13ed:       c5 fd 58 44 cb a0       vaddpd ymm0,ymm0, [rbx+rcx*8-0x60]
    13f3:       c5 f5 58 4c cb c0       vaddpd ymm1,ymm1, [rbx+rcx*8-0x40]
    13f9:       c5 ed 58 54 cb e0       vaddpd ymm2,ymm2, [rbx+rcx*8-0x20]
    13ff:       c5 e5 58 1c cb          vaddpd ymm3,ymm3, [rbx+rcx*8]
    1404:       48 83 c1 50             add    rcx,0x50
    1408:       48 81 f9 ec 0f 00 00    cmp    rcx,0xfec
    140f:       0f 85 4b ff ff ff       jne    1360 <main+0x80>

与杂注相比,当内联为main时,展开16,带有8个累加器.4000不是16x4的倍数,因此循环退出条件位于循环中间的8个加法组之间.

# clang 16  with pragma, unrolls by 16 with 8 accumulators
inner_loop_top:
    13f0:       c5 fd 58 84 cb 20 fe ff ff      vaddpd ymm0,ymm0,[rbx+rcx*8-0x1e0]
    13f9:       c5 f5 58 8c cb 40 fe ff ff      vaddpd ymm1,ymm1,[rbx+rcx*8-0x1c0]
    1402:       c5 ed 58 94 cb 60 fe ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0x1a0]
    140b:       c5 e5 58 9c cb 80 fe ff ff      vaddpd ymm3,ymm3, [rbx+rcx*8-0x180]
    1414:       c5 dd 58 a4 cb a0 fe ff ff      vaddpd ymm4,ymm4,[rbx+rcx*8-0x160]
    141d:       c5 d5 58 ac cb c0 fe ff ff      vaddpd ymm5,ymm5, [rbx+rcx*8-0x140]
    1426:       c5 cd 58 b4 cb e0 fe ff ff      vaddpd ymm6,ymm6,[rbx+rcx*8-0x120]
    142f:       c5 c5 58 bc cb 00 ff ff ff      vaddpd ymm7,ymm7, [rbx+rcx*8-0x100]
    1438:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]       # JCC erratume workaround
    1440:       48 81 f9 bc 0f 00 00    cmp    rcx,0xfbc
    1447:       0f 84 33 ff ff ff       je     1380 <main+0x60>
    144d:       c5 fd 58 84 cb 20 ff ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0xe0]
    1456:       c5 f5 58 8c cb 40 ff ff ff      vaddpd ymm1,ymm1, [rbx+rcx*8-0xc0]
    145f:       c5 ed 58 94 cb 60 ff ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0xa0]
    1468:       c5 e5 58 5c cb 80       vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]
    146e:       c5 dd 58 64 cb a0       vaddpd ymm4,ymm4, [rbx+rcx*8-0x60]
    1474:       c5 d5 58 6c cb c0       vaddpd ymm5,ymm5, [rbx+rcx*8-0x40]
    147a:       c5 cd 58 74 cb e0       vaddpd ymm6,ymm6, [rbx+rcx*8-0x20]
    1480:       c5 c5 58 3c cb          vaddpd ymm7,ymm7, [rbx+rcx*8]
    1485:       48 83 c1 40             add    rcx,0x40
    1489:       e9 62 ff ff ff          jmp    13f0 <main+0xd0>

我try 更改源代码以鼓励编译器递增指针,但clang没有领会其中的暗示,而是在它使用的寄存器(如[rdi + r8*8 + 0x20])中发明了一个索引计数器

  const double * endp = array+size;
#pragma clang loop interleave_count(8)
  while (array != endp) {  // like a C++ range-for
    sum += *array++;       // no benefit, clang pessimizes back to an index
  }

更新的微基准源代码

// #define SIZE 5120 // 40 KiB, fits in Raptor Lake's 48KiB
#define SIZE 4000     // fits in SKL's 32KiB L1d cache
#define REPS 5

...

        double *array = aligned_alloc(32, array_size * sizeof(double));
//  double *array = malloc(array_size * sizeof(double));

...

double sum_array(const double *array, long size) {
  double sum = 0.0;
//#pragma clang loop interleave_count(8)   // uncomment this, optionally
  for (size_t i = 0; i < size; ++i) {
    sum += array[i];
  }
  return sum;
}


int main() {
  double *xs = make_random_array(SIZE);
  if (xs == NULL) return 1;

  const int  inner_reps = 1000000;  // sum the array this many times each timed interval
  for (int i = 0; i < REPS; i++) {
    clock_t start_time = clock();
    volatile double r;  // do something with the sum even when we don't print
    for (int i=0 ; i<inner_reps ; i++){  // new inner loop
       r = sum_array(xs, SIZE);
       //  asm(""::"r"(xs) :"memory");  // forget about the array contents and redo the sum
       // turned out not to be necessary, clang is still doing the work
    }
    clock_t end_time = clock();
    double dt = (double)(end_time - start_time) / (CLOCKS_PER_SEC * inner_reps);
    printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
  }

  free(xs);
  return 0;
}

在每个时间间隔内添加了const int inner_reps = 1000000;个重复的求和计数,以及一些确保优化器不会失败的措施(Godbolt -也将SIZE减少到4000以适应我的32 KiB L1 d),在我的Skylake上,4.2 GHz,我得到了预期的16倍加速.

GCC 13.2.1,在Arch GNU/LINUX,内核6.5上Clang 16.0.6

# Without any vectorization
$ gcc -O3 -march=native -Wall arr-sum.c
taskset -c 1 perf stat  -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_single   -r1 ./a.out 1057.69 MFLOPS (2003.09)
1059.17 MFLOPS (2003.09)
1059.67 MFLOPS (2003.09)
1060.30 MFLOPS (2003.09)
1060.34 MFLOPS (2003.09)
... perf results below

# with 1 vector accumulator
$ gcc -O3 -march=native -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
4389.68 MFLOPS (2003.09)
4389.32 MFLOPS (2003.09)
4381.48 MFLOPS (2003.09)
4393.57 MFLOPS (2003.09)
4389.98 MFLOPS (2003.09)
... perf results below

# unrolled by 4 vectors
$ clang -O3 -march=native -ffast-math -Wall arr-sum.c   # clang unrolls by default
$ taskset -c 1 perf stat ... a.out
17048.41 MFLOPS (2003.09)
17072.49 MFLOPS (2003.09)
17060.55 MFLOPS (2003.09)
17081.02 MFLOPS (2003.09)
17099.79 MFLOPS (2003.09)
... perf results below, but including:
     2,303,995,395      idq.mite_uops                    #    1.965 G/sec                     
  # suffering from the JCC erratum in the inner loop; avoid it:

$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
17013.53 MFLOPS (2003.09)
17061.79 MFLOPS (2003.09)
17064.99 MFLOPS (2003.09)
17109.44 MFLOPS (2003.09)
17001.74 MFLOPS (2003.09)
... perf results below; summary: 1.17 seconds
     4,905,130,231      cycles                           #    4.178 GHz                       
     5,941,872,098      instructions                     #    1.21  insn per cycle
         5,165,165      idq.mite_uops                    #    4.399 M/sec
     5,015,000,000      fp_arith_inst_retired.256b_packed_double #    4.271 G/sec

 # With  #pragma clang loop interleave_count(8) in the source
 # for unrolling by 8 instead of 4
$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
28505.05 MFLOPS (2003.09)
28553.48 MFLOPS (2003.09)
28556.13 MFLOPS (2003.09)
28597.37 MFLOPS (2003.09)
28548.18 MFLOPS (2003.09)
 # imperfect scheduling and a front-end bottleneck from clang's bad choice of addressing-mode
 # means we don't get another 2x over the default.

(对于perf stat -d,我还确认了L1d缓存未命中率低于1%.使用更大的数组大小,比如可以放入Skylake的256K二级缓存中的20000,但不能放入L1d,我仍然可以获得相当接近每个时钟1个向量的吞吐量.)

JCC erratum workaround(仅限Skylake系列,不是您的CPU)在这种情况下提供了微不足道的进一步加速,即使使用传统的解码,前端也不是瓶颈:问题是发生了解层,所以解码器不会被2-uop指令卡住.而uops_issued.any的平均吞吐量仍然只有2.18/时钟(4倍展开).

So we get a factor of 16 speedup on Skylake from vectorizing with AVX (4x) and instruction-level parallelism of 4 accumulators.这仍然只是平均每个时钟周期略好于1vaddpd(因为在重复循环迭代之间存在ILP),但是clang的4个dep链仅为Skylake的4个周期延迟x 2insn/周期吞吐量=8个fp数学指令的一半.

展开4将留下另一个剩下的2个性能因素(对于Skylake,对于Alder Lake和以后的更少).最新消息:我们用pragma美元买到了大部分).但是,只有当数据在L1d缓存中处于热状态时,通过仔细的缓存阻塞,或者如果您在寄存器中对数据做更多的工作(更高的计算强度,而不是每次加载只增加1次),才能实现这一点.要获得另一个完整的2倍,还需要一个知道Sandybridge家族解层的优化器,而clang的显然不是.clang默认的4个累加器似乎是合理的,更多的累加器将意味着更多的初始化和清理工作,尽管只有4个累加器就展开20似乎太多了,这就像是浪费I-缓存/uop-缓存占用空间.


性能计数器结果

仅在i7-6700k Skylake(EPP=性能)上的用户空间计数,使用Linux内核&;perf 6.5.这是针对整个过程的,包括启动,但内部重复计数为100万意味着其总时间的绝大部分花在了我们关心的循环中,而不是启动.

Scalar loop:
Performance counter stats for './a.out' (GCC O3-native without fast-math):

     18,902.70 msec task-clock                       #    1.000 CPUs utilized
            54      context-switches                 #    2.857 /sec
             0      cpu-migrations                   #    0.000 /sec
            72      page-faults                      #    3.809 /sec
79,099,401,032      cycles                           #    4.185 GHz
35,069,666,963      instructions                     #    0.44  insn per cycle
30,109,096,046      uops_issued.any                  #    1.593 G/sec
50,096,899,159      uops_executed.thread             #    2.650 G/sec
    46,353,551      idq.mite_uops                    #    2.452 M/sec
             0      fp_arith_inst_retired.256b_packed_double #    0.000 /sec

  18.902876984 seconds time elapsed

  18.893778000 seconds user
   0.000000000 seconds sys

请注意,fp_arith_inst_retired.256b_packed_double-无256位SIMD指令的计数为0.

Vectorized but not unrolled:
Performance counter stats for './a.out' (GCC O3-native-fast-math):

      4,559.54 msec task-clock                       #    1.000 CPUs utilized
             8      context-switches                 #    1.755 /sec
             0      cpu-migrations                   #    0.000 /sec
            74      page-faults                      #   16.230 /sec
19,093,881,407      cycles                           #    4.188 GHz
20,060,557,627      instructions                     #    1.05  insn per cycle
15,094,070,341      uops_issued.any                  #    3.310 G/sec
20,075,885,996      uops_executed.thread             #    4.403 G/sec
    12,015,692      idq.mite_uops                    #    2.635 M/sec
 5,000,000,000      fp_arith_inst_retired.256b_packed_double #    1.097 G/sec

   4.559770793 seconds time elapsed

   4.557838000 seconds user
   0.000000000 seconds sys

Vectorized, unrolled by 20 with 4 accumulators:
Performance counter stats for './a.out': (Clang -O3-native-fast-math JCC-workaround)

      1,174.07 msec task-clock                       #    1.000 CPUs utilized
             5      context-switches                 #    4.259 /sec
             0      cpu-migrations                   #    0.000 /sec
            72      page-faults                      #   61.325 /sec
 4,905,130,231      cycles                           #    4.178 GHz
 5,941,872,098      instructions                     #    1.21  insn per cycle
10,689,939,125      uops_issued.any                  #    9.105 G/sec
10,566,645,887      uops_executed.thread             #    9.000 G/sec
     5,165,165      idq.mite_uops                    #    4.399 M/sec
 5,015,000,000      fp_arith_inst_retired.256b_packed_double #    4.271 G/sec

   1.174507232 seconds time elapsed

   1.173769000 seconds user
   0.000000000 seconds sys

请注意略多于256位的向量指令:即在将水平求和运算降低到1标量之前,将4个累加器减少到1,即3xvaddpd.(它从高半部分的vextractf128开始,然后使用128位向量指令.因此,此计数器不计算它们,但它们仍然与开始的下一次迭代的工作竞争.)

Vectorized, unrolled by 16 with 8 accumulators:
Performance counter stats for './a.out' (clang -O3 native fast-math #pragma ... interleave_count(8)):

        701.30 msec task-clock                       #    0.999 CPUs utilized
             3      context-switches                 #    4.278 /sec
             0      cpu-migrations                   #    0.000 /sec
            71      page-faults                      #  101.241 /sec
 2,931,696,392      cycles                           #    4.180 GHz
 6,566,898,298      instructions                     #    2.24  insn per cycle
11,249,046,508      uops_issued.any                  #   16.040 G/sec
11,019,891,003      uops_executed.thread             #   15.714 G/sec
     3,153,961      idq.mite_uops                    #    4.497 M/sec
 5,035,000,000      fp_arith_inst_retired.256b_packed_double #    7.180 G/sec

   0.701728321 seconds time elapsed

   0.701217000 seconds user
   0.000000000 seconds sys

循环后的更多清理工作,7x vaddpd以降至1 向量.而不是2倍的加速,而是在16.040 uops / 4.180 GHz~=3.87平均每时钟发出的uop数上遇到瓶颈,大多数周期发出Skylake的最大值为4.这是因为Clang/LLVM不知道如何使用索引寻址模式来调整Intel CPU.(执行的uop实际上比发出的uop少lower次,因此与ALU的负载微融合非常少,在每次迭代之前对8个向量进行8xvxorps的零化,这些迭代需要一个发出槽,但不需要后端执行单元.)

7.180 / 4.18 GHz=每个时钟周期平均执行1.71条256位FP指令.

(CPU可能一直以4.20 GHz运行,但该频率是周期计数(仅限用户空间)除以任务时钟得出的.在内核中花费的时间(页面错误和中断)没有计算在内,因为我们使用了perf stat --all-user)


加入时间:清华2007年01月25日下午3:33

通过避免索引寻址模式修复前端瓶颈可从1.71提高到1.81 vaddpd/时钟.(不是2.0,因为不完美的uop调度会损失一个周期,没有空闲的时间来弥补.)在一个4.2GHz的单核上,这大约是30281.47 MFLOP/S.

作为起点,我在带有展开杂注的C版本上使用了clang -O3 -fno-unroll-loops -S -march=native -ffast-math -Wall arr-sum.c -masm=intel -o arr-sum-asm.S,因此that循环仍然使用8个累加器展开,但只展开8而不是16.

外部的REPEAT循环保持卷起,所以我只需手动编辑ASM循环的一个副本(内联到Main中).几条指令上的ds个前缀是为了解决JCC错误.请注意,这些指令都不需要disp32寻址模式,因为我将指针增量放在了正确的位置,以便从Full-0x80中受益.+0x7f,实际上是从-0x80到+0x60.因此,机器代码的大小比clang的要小得多,指令长度为5(或[rdi+0]的指令长度为4).add最终需要一台imm32,但只有一台.关键的是,它们保持微融合,将前端uop带宽削减了近一半.

    vxorpd  xmm0, xmm0, xmm0
 ...
    vxorpd  xmm7, xmm7, xmm7    # compiler-generated sumvec = 0
    mov     ecx, 4000 / (4*8)   # loop trip-count
    mov    rdi, rbx             # startp = arr
    .p2align        4, 0x90
.LBB2_7:                #   Parent Loop BB2_5 Depth=1
                        #     Parent Loop BB2_6 Depth=2
                        # =>    This Inner Loop Header: Depth=3
    ds vaddpd ymm0, ymm0, [rdi + 32*0]
    vaddpd  ymm1, ymm1, [rdi + 32*1]
    vaddpd  ymm2, ymm2, [rdi + 32*2]
    ds vaddpd   mm5, ymm5, [rdi + 32*3]
    add   rdi, 256
    vaddpd  ymm3, ymm3, [rdi - 32*4]
    ds vaddpd   ymm6, ymm6, [rdi - 32*3]
    vaddpd  ymm7, ymm7, [rdi - 32*2]
    vaddpd  ymm4, ymm4, [rdi - 32*1]
    dec     rcx           # not spanning a 32B boundary
    jne     .LBB2_7
# %bb.8:                                #   in Loop: Header=BB2_6 Depth=2
    vaddpd  ymm0, ymm1, ymm0
    vaddpd  ymm1, ymm5, ymm2
    ... hsum
$ taskset -c 1  perf stat  -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_double   -r1 ./a.out 
30281.47 MFLOPS (2003.09)
30057.33 MFLOPS (2003.09)
30138.64 MFLOPS (2003.09)
30160.00 MFLOPS (2003.09)
29979.61 MFLOPS (2003.09)

 Performance counter stats for './a.out':

            664.79 msec task-clock                       #    0.999 CPUs utilized             
                 3      context-switches                 #    4.513 /sec
                 0      cpu-migrations                   #    0.000 /sec
                73      page-faults                      #  109.809 /sec
     2,775,830,392      cycles                           #    4.176 GHz
     7,007,878,485      instructions                     #    2.52  insn per cycle
     6,457,154,731      uops_issued.any                  #    9.713 G/sec
    11,378,180,211      uops_executed.thread             #   17.115 G/sec
         3,634,644      idq.mite_uops                    #    5.467 M/sec
     5,035,000,000      fp_arith_inst_retired.256b_packed_double #    7.574 G/sec

       0.665220579 seconds time elapsed

       0.664698000 seconds user
       0.000000000 seconds sys

uops_issued.any现在大约是2.32个/周期,有足够的空间用于uop-缓存获取和其他前端瓶颈.

在10点而不是8点展开只会带来tiny%的加速,比如662.49毫秒的良好运行总时间,以及30420.80的最佳MFLOPS.IPC约为2.41.

在使FP管道饱和之前,L1D缓存带宽是SKL上的最后一个瓶颈,不能完全支持2x 32字节加载/时钟.更改INSN中的3个以将寄存器加到自身(sum7 += sum7)可将10累加器版本的总时间加快至619.11毫秒,最佳MFLOPS为32424.90,2.58IPC,平均为1.957 256位微指令/时钟.(在清理过程中,FP端口必须与几个128位加法器竞争.)

猛龙湖可以做3个负载/时钟,即使对于矢量,所以应该不会有问题.

C++相关问答推荐

为什么这个C程序代码会产生以下结果?

字符数组,字符指针,在一种情况下工作,但在另一种情况下不工作?

sizeof结果是否依赖于字符串的声明?

如果实际的syscall是CLONE(),那么为什么strace接受fork()呢?

在#include中使用C宏变量

为什么在C中进行大量的位移位?

为什么I2C会发送错误的数据?

将uintptr_t添加到指针是否对称?

在基本OpenGL纹理四边形中的一个三角形中进行渲染

GTK函数调用将完全不相关的char* 值搞乱

OpenSSL:如何将吊销列表与SSL_CTX_LOAD_VERIFY_LOCATIONS一起使用?

为什么未初始化的 struct 的数组从另一个数组获取值?

在编写代码时,Clion比vscode有更多的问题指示器

使用ld将目标文件链接到C标准库

如何在C宏定义中包含双引号?

C编译和运行

我编写这段代码是为了判断一个数字是质数、阿姆斯特朗还是完全数,但由于某种原因,当我使用大数时,它不会打印出来

在同一范围内对具有相同类型的变量执行的相同操作在同一C代码中花费的时间不同

C 中从 Unix 纪元时间转换的损坏

无法在 C 中打开文本文件,我想从中读取文本作为数据并将其写入数组