C++ 是什么让numpy.sum比优化的(自动矢量化的)C循环更快

发布于01月16日

我正在try 编写一个C程序，它在一组双精度数上的速度高达numpy.sum，但似乎失败了.

以下是我衡量麻木表现的方法:

import numpy as np
import time

SIZE=4000000
REPS=5

xs = np.random.rand(SIZE)
print(xs.dtype)

for _ in range(REPS):
    start = time.perf_counter()
    r = np.sum(xs)
    end = time.perf_counter()
    print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")

输出为:

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)

现在try 用C:做一些类似的事情:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE 4000000
#define REPS 5

double *make_random_array(long array_size) {
  double *array = malloc(array_size * sizeof(double));
  if (array == NULL)
    return NULL;
  srand(0);
  for (size_t i = 0; i < array_size; ++i) {
    array[i] = (double)rand() / RAND_MAX;
  }
  return array;
}

double sum_array(const double *array, long size) {
  double sum = 0.0;
  for (size_t i = 0; i < size; ++i) {
    sum += array[i];
  }
  return sum;
}

int main() {
  double *xs = make_random_array(SIZE);
  if (xs == NULL) return 1;

  for (int i = 0; i < REPS; i++) {
    clock_t start_time = clock();
    double r = sum_array(xs, SIZE);
    clock_t end_time = clock();
    double dt = (double)(end_time - start_time) / CLOCKS_PER_SEC;
    printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
  }

  free(xs);
  return 0;
}

用gcc -o main -Wall -O3 -mavx main.c编译并运行它，输出是:

1850.14 MFLOPS (1999882.86)
1857.01 MFLOPS (1999882.86)
1900.24 MFLOPS (1999882.86)
1903.86 MFLOPS (1999882.86)
1906.58 MFLOPS (1999882.86)

显然，这比麻木慢得多.

根据topCPU使用率，python进程的使用率约为top%，因此看起来Numpy没有并行化任何东西.

C代码似乎使用了256位AVX寄存器(当使用-S进行编译时，xmm0上有vaddsd条指令).这似乎是最好的 Select ，因为我使用的机器似乎不支持AVX-512:

$ egrep 'model name|flags' /proc/cpuinfo  | head -n2
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

为了击败这段C代码，NumPy做了什么把戏？

# clang 16 no pragma, unrolls by 20 with 4 accumulators inner_loop_top: 1360: c5 fd 58 84 cb a0 fd ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0x260] 1369: c5 f5 58 8c cb c0 fd ff ff vaddpd ymm1,ymm1,[rbx+rcx*8-0x240] 1372: c5 ed 58 94 cb e0 fd ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0x220] 137b: c5 e5 58 9c cb 00 fe ff ff vaddpd ymm3,ymm3, [rbx+rcx*8-0x200] 1384: c5 fd 58 84 cb 20 fe ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0x1e0] ... ymm1, ymm2 139f: c5 e5 58 9c cb 80 fe ff ff vaddpd ymm3,ymm3,[rbx+rcx*8-0x180] ... 2 more copies of ymm0..3, ending with the next insn, the first to use a 1-byte disp8 13e7: c5 e5 58 5c cb 80 vaddpd ymm3,ymm3, [rbx+rcx*8-0x80] 13ed: c5 fd 58 44 cb a0 vaddpd ymm0,ymm0, [rbx+rcx*8-0x60] 13f3: c5 f5 58 4c cb c0 vaddpd ymm1,ymm1, [rbx+rcx*8-0x40] 13f9: c5 ed 58 54 cb e0 vaddpd ymm2,ymm2, [rbx+rcx*8-0x20] 13ff: c5 e5 58 1c cb vaddpd ymm3,ymm3, [rbx+rcx*8] 1404: 48 83 c1 50 add rcx,0x50 1408: 48 81 f9 ec 0f 00 00 cmp rcx,0xfec 140f: 0f 85 4b ff ff ff jne 1360 <main+0x80>

# clang 16 with pragma, unrolls by 16 with 8 accumulators inner_loop_top: 13f0: c5 fd 58 84 cb 20 fe ff ff vaddpd ymm0,ymm0,[rbx+rcx*8-0x1e0] 13f9: c5 f5 58 8c cb 40 fe ff ff vaddpd ymm1,ymm1,[rbx+rcx*8-0x1c0] 1402: c5 ed 58 94 cb 60 fe ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0x1a0] 140b: c5 e5 58 9c cb 80 fe ff ff vaddpd ymm3,ymm3, [rbx+rcx*8-0x180] 1414: c5 dd 58 a4 cb a0 fe ff ff vaddpd ymm4,ymm4,[rbx+rcx*8-0x160] 141d: c5 d5 58 ac cb c0 fe ff ff vaddpd ymm5,ymm5, [rbx+rcx*8-0x140] 1426: c5 cd 58 b4 cb e0 fe ff ff vaddpd ymm6,ymm6,[rbx+rcx*8-0x120] 142f: c5 c5 58 bc cb 00 ff ff ff vaddpd ymm7,ymm7, [rbx+rcx*8-0x100] 1438: 0f 1f 84 00 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0] # JCC erratume workaround 1440: 48 81 f9 bc 0f 00 00 cmp rcx,0xfbc 1447: 0f 84 33 ff ff ff je 1380 <main+0x60> 144d: c5 fd 58 84 cb 20 ff ff ff vaddpd ymm0,ymm0, [rbx+rcx*8-0xe0] 1456: c5 f5 58 8c cb 40 ff ff ff vaddpd ymm1,ymm1, [rbx+rcx*8-0xc0] 145f: c5 ed 58 94 cb 60 ff ff ff vaddpd ymm2,ymm2, [rbx+rcx*8-0xa0] 1468: c5 e5 58 5c cb 80 vaddpd ymm3,ymm3, [rbx+rcx*8-0x80] 146e: c5 dd 58 64 cb a0 vaddpd ymm4,ymm4, [rbx+rcx*8-0x60] 1474: c5 d5 58 6c cb c0 vaddpd ymm5,ymm5, [rbx+rcx*8-0x40] 147a: c5 cd 58 74 cb e0 vaddpd ymm6,ymm6, [rbx+rcx*8-0x20] 1480: c5 c5 58 3c cb vaddpd ymm7,ymm7, [rbx+rcx*8] 1485: 48 83 c1 40 add rcx,0x40 1489: e9 62 ff ff ff jmp 13f0 <main+0xd0>

const double * endp = array+size; #pragma clang loop interleave_count(8) while (array != endp) { // like a C++ range-for sum += *array++; // no benefit, clang pessimizes back to an index }

// #define SIZE 5120 // 40 KiB, fits in Raptor Lake's 48KiB #define SIZE 4000 // fits in SKL's 32KiB L1d cache #define REPS 5 ... double *array = aligned_alloc(32, array_size * sizeof(double)); // double *array = malloc(array_size * sizeof(double)); ... double sum_array(const double *array, long size) { double sum = 0.0; //#pragma clang loop interleave_count(8) // uncomment this, optionally for (size_t i = 0; i < size; ++i) { sum += array[i]; } return sum; } int main() { double *xs = make_random_array(SIZE); if (xs == NULL) return 1; const int inner_reps = 1000000; // sum the array this many times each timed interval for (int i = 0; i < REPS; i++) { clock_t start_time = clock(); volatile double r; // do something with the sum even when we don't print for (int i=0 ; i<inner_reps ; i++){ // new inner loop r = sum_array(xs, SIZE); // asm(""::"r"(xs) :"memory"); // forget about the array contents and redo the sum // turned out not to be necessary, clang is still doing the work } clock_t end_time = clock(); double dt = (double)(end_time - start_time) / (CLOCKS_PER_SEC * inner_reps); printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r); } free(xs); return 0; }

# Without any vectorization $ gcc -O3 -march=native -Wall arr-sum.c taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_single -r1 ./a.out 1057.69 MFLOPS (2003.09) 1059.17 MFLOPS (2003.09) 1059.67 MFLOPS (2003.09) 1060.30 MFLOPS (2003.09) 1060.34 MFLOPS (2003.09) ... perf results below # with 1 vector accumulator $ gcc -O3 -march=native -ffast-math -Wall arr-sum.c $ taskset -c 1 perf stat ... a.out 4389.68 MFLOPS (2003.09) 4389.32 MFLOPS (2003.09) 4381.48 MFLOPS (2003.09) 4393.57 MFLOPS (2003.09) 4389.98 MFLOPS (2003.09) ... perf results below # unrolled by 4 vectors $ clang -O3 -march=native -ffast-math -Wall arr-sum.c # clang unrolls by default $ taskset -c 1 perf stat ... a.out 17048.41 MFLOPS (2003.09) 17072.49 MFLOPS (2003.09) 17060.55 MFLOPS (2003.09) 17081.02 MFLOPS (2003.09) 17099.79 MFLOPS (2003.09) ... perf results below, but including: 2,303,995,395 idq.mite_uops # 1.965 G/sec # suffering from the JCC erratum in the inner loop; avoid it: $ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c $ taskset -c 1 perf stat ... a.out 17013.53 MFLOPS (2003.09) 17061.79 MFLOPS (2003.09) 17064.99 MFLOPS (2003.09) 17109.44 MFLOPS (2003.09) 17001.74 MFLOPS (2003.09) ... perf results below; summary: 1.17 seconds 4,905,130,231 cycles # 4.178 GHz 5,941,872,098 instructions # 1.21 insn per cycle 5,165,165 idq.mite_uops # 4.399 M/sec 5,015,000,000 fp_arith_inst_retired.256b_packed_double # 4.271 G/sec # With #pragma clang loop interleave_count(8) in the source # for unrolling by 8 instead of 4 $ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c $ taskset -c 1 perf stat ... a.out 28505.05 MFLOPS (2003.09) 28553.48 MFLOPS (2003.09) 28556.13 MFLOPS (2003.09) 28597.37 MFLOPS (2003.09) 28548.18 MFLOPS (2003.09) # imperfect scheduling and a front-end bottleneck from clang's bad choice of addressing-mode # means we don't get another 2x over the default.

18,902.70 msec task-clock # 1.000 CPUs utilized 54 context-switches # 2.857 /sec 0 cpu-migrations # 0.000 /sec 72 page-faults # 3.809 /sec 79,099,401,032 cycles # 4.185 GHz 35,069,666,963 instructions # 0.44 insn per cycle 30,109,096,046 uops_issued.any # 1.593 G/sec 50,096,899,159 uops_executed.thread # 2.650 G/sec 46,353,551 idq.mite_uops # 2.452 M/sec 0 fp_arith_inst_retired.256b_packed_double # 0.000 /sec 18.902876984 seconds time elapsed 18.893778000 seconds user 0.000000000 seconds sys

4,559.54 msec task-clock # 1.000 CPUs utilized 8 context-switches # 1.755 /sec 0 cpu-migrations # 0.000 /sec 74 page-faults # 16.230 /sec 19,093,881,407 cycles # 4.188 GHz 20,060,557,627 instructions # 1.05 insn per cycle 15,094,070,341 uops_issued.any # 3.310 G/sec 20,075,885,996 uops_executed.thread # 4.403 G/sec 12,015,692 idq.mite_uops # 2.635 M/sec 5,000,000,000 fp_arith_inst_retired.256b_packed_double # 1.097 G/sec 4.559770793 seconds time elapsed 4.557838000 seconds user 0.000000000 seconds sys

1,174.07 msec task-clock # 1.000 CPUs utilized 5 context-switches # 4.259 /sec 0 cpu-migrations # 0.000 /sec 72 page-faults # 61.325 /sec 4,905,130,231 cycles # 4.178 GHz 5,941,872,098 instructions # 1.21 insn per cycle 10,689,939,125 uops_issued.any # 9.105 G/sec 10,566,645,887 uops_executed.thread # 9.000 G/sec 5,165,165 idq.mite_uops # 4.399 M/sec 5,015,000,000 fp_arith_inst_retired.256b_packed_double # 4.271 G/sec 1.174507232 seconds time elapsed 1.173769000 seconds user 0.000000000 seconds sys

701.30 msec task-clock # 0.999 CPUs utilized 3 context-switches # 4.278 /sec 0 cpu-migrations # 0.000 /sec 71 page-faults # 101.241 /sec 2,931,696,392 cycles # 4.180 GHz 6,566,898,298 instructions # 2.24 insn per cycle 11,249,046,508 uops_issued.any # 16.040 G/sec 11,019,891,003 uops_executed.thread # 15.714 G/sec 3,153,961 idq.mite_uops # 4.497 M/sec 5,035,000,000 fp_arith_inst_retired.256b_packed_double # 7.180 G/sec 0.701728321 seconds time elapsed 0.701217000 seconds user 0.000000000 seconds sys

vxorpd xmm0, xmm0, xmm0 ... vxorpd xmm7, xmm7, xmm7 # compiler-generated sumvec = 0 mov ecx, 4000 / (4*8) # loop trip-count mov rdi, rbx # startp = arr .p2align 4, 0x90 .LBB2_7: # Parent Loop BB2_5 Depth=1 # Parent Loop BB2_6 Depth=2 # => This Inner Loop Header: Depth=3 ds vaddpd ymm0, ymm0, [rdi + 32*0] vaddpd ymm1, ymm1, [rdi + 32*1] vaddpd ymm2, ymm2, [rdi + 32*2] ds vaddpd mm5, ymm5, [rdi + 32*3] add rdi, 256 vaddpd ymm3, ymm3, [rdi - 32*4] ds vaddpd ymm6, ymm6, [rdi - 32*3] vaddpd ymm7, ymm7, [rdi - 32*2] vaddpd ymm4, ymm4, [rdi - 32*1] dec rcx # not spanning a 32B boundary jne .LBB2_7 # %bb.8: # in Loop: Header=BB2_6 Depth=2 vaddpd ymm0, ymm1, ymm0 vaddpd ymm1, ymm5, ymm2 ... hsum

$ taskset -c 1 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_double -r1 ./a.out 30281.47 MFLOPS (2003.09) 30057.33 MFLOPS (2003.09) 30138.64 MFLOPS (2003.09) 30160.00 MFLOPS (2003.09) 29979.61 MFLOPS (2003.09) Performance counter stats for './a.out': 664.79 msec task-clock # 0.999 CPUs utilized 3 context-switches # 4.513 /sec 0 cpu-migrations # 0.000 /sec 73 page-faults # 109.809 /sec 2,775,830,392 cycles # 4.176 GHz 7,007,878,485 instructions # 2.52 insn per cycle 6,457,154,731 uops_issued.any # 9.713 G/sec 11,378,180,211 uops_executed.thread # 17.115 G/sec 3,634,644 idq.mite_uops # 5.467 M/sec 5,035,000,000 fp_arith_inst_retired.256b_packed_double # 7.574 G/sec 0.665220579 seconds time elapsed 0.664698000 seconds user 0.000000000 seconds sys

C++ 是什么让numpy.sum比优化的(自动矢量化的)C循环更快

推荐答案

更新的微基准源代码

性能计数器结果

加入时间:清华2007年01月25日下午3:33

C++相关问答推荐

为什么这个C程序代码会产生以下结果？

字符数组，字符指针，在一种情况下工作，但在另一种情况下不工作？

sizeof结果是否依赖于字符串的声明？

如果实际的syscall是CLONE()，那么为什么strace接受fork()呢？

在#include中使用C宏变量

为什么在C中进行大量的位移位？

为什么I2C会发送错误的数据？

将uintptr_t添加到指针是否对称？

在基本OpenGL纹理四边形中的一个三角形中进行渲染

GTK函数调用将完全不相关的char* 值搞乱

OpenSSL：如何将吊销列表与SSL_CTX_LOAD_VERIFY_LOCATIONS一起使用？

为什么未初始化的 struct 的数组从另一个数组获取值？

在编写代码时，Clion比vscode有更多的问题指示器

使用ld将目标文件链接到C标准库

如何在C宏定义中包含双引号？

C编译和运行

我编写这段代码是为了判断一个数字是质数、阿姆斯特朗还是完全数，但由于某种原因，当我使用大数时，它不会打印出来

在同一范围内对具有相同类型的变量执行的相同操作在同一C代码中花费的时间不同

C 中从 Unix 纪元时间转换的损坏

无法在 C 中打开文本文件，我想从中读取文本作为数据并将其写入数组