介绍

我有一个相当简单的Cython模块--为了进行特定的测试,我进一步简化了它.

这个模块只有一个类,它只有一个有趣的方法(名为run):这个方法接受一个Fortran排序的2D NumPy数组和两个1D NumPy数组作为输入,并在这些数组上做一些非常非常简单的事情(参见下面的代码).

为了进行基准测试,我在Windows 10 64位、Python3.9.10 64位、Cython0.29.32上与MSVC、GCC 8、GCC 11、GCC 12和GCC 13编译了完全相同的模块.我从优秀的Brecht Sanders GitHub页面(https://github.com/brechtsanders)获得的所有GCC编译器.

主要问题

这篇很长的帖子的首要问题是:我只是好奇,是否有人能解释为什么GCC12和GCC13比GCC11慢得多(GCC11是最快的).看起来GCC的每一期演出都在下降,而不是变得更好……

基准

在基准测试中,我简单地改变了2D和1D数组的数组维度(mn)以及2D和1D数组中非零条目的数量.我 for each 编译器版本重复了20次run方法,每组mn条目以及非零条目.

我正在使用的优化设置:

MVSC

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']

GCC

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

我观察到的是以下几点:

  • 到目前为止,MSVC在执行基准测试方面是最慢的(为什么会在Windows上?)
  • 由于GCC11比GCC8更快,因此GCC8-&GCC11的发展前景看好
  • GCC12和GCC13都比GCC11慢得多,其中GCC13最差(比GCC11慢一倍,比GCC12差得多)

Table of Results:

Runtimes are in milliseconds (ms)个个

enter image description here

Graph (NOTE: Logarithmic Y axis!!):

Runtimes are in milliseconds (ms)个个

enter image description here

代码

Cython file:

###############################################################################

import numpy as np
cimport numpy as np
import cython

from cython.view cimport array as cvarray

from libc.float cimport FLT_MAX

DTYPE_float = np.float32
ctypedef np.float32_t DTYPE_float_t

cdef float SMALL = 1e-10
cdef int MAXSIZE = 1000000

cdef extern from "math.h" nogil:
    cdef float fabsf(float x)

###############################################################################

@cython.final
cdef class CythonLoops:

    cdef int m, n
    cdef int [:] col_starts
    cdef int [:] row_indices
    cdef double [:] x

    def __cinit__(self):

        self.m = 0
        self.n = 0

        self.col_starts  = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.row_indices = cvarray(shape=(MAXSIZE,), itemsize=sizeof(int), format='i')
        self.x = cvarray(shape=(MAXSIZE,), itemsize=sizeof(double), format='d')

    @cython.boundscheck(False) # turn off bounds-checking for entire function
    @cython.wraparound(False)  # turn off negative index wrapping for entire function
    @cython.nonecheck(False)
    @cython.initializedcheck(False)
    @cython.cdivision(True)
    cpdef run(self, DTYPE_float_t[::1, :] matrix,
                    DTYPE_float_t[:]      ub_values,
                    DTYPE_float_t[:]      priority):

        cdef Py_ssize_t i, j, m, n
        cdef int nza, collen
        cdef double too_large, ok, obj
        cdef float ub, element

        cdef int [:] col_starts = self.col_starts
        cdef int [:] row_indices = self.row_indices
        cdef double [:] x = self.x

        m = matrix.shape[0]
        n = matrix.shape[1]

        self.m = m
        self.n = n

        nza = 0
        collen = 0
        for i in range(n):
            for j in range(m+1):
                if j == 0:
                    element = priority[i]
                else:
                    element = matrix[j-1, i]

                if fabsf(element) < SMALL:
                    continue

                if j == 0:
                    obj = <double>element
                    # Do action 1 with external library
                else:
                    collen = nza + 1
                    col_starts[collen] = i+1
                    row_indices[collen] = j
                    x[collen] = <double>element
                    nza += 1

            ub = ub_values[i]

            if ub > FLT_MAX:
                too_large = 0.0
                # Do action 2 with external library
            elif ub > SMALL:
                ok = <double>ub
                # Do action 3 with external library

        # Use x, row_indices and col_starts in the external library

Setup file:

我使用以下代码来编译它:

python setup.py build_ext --inplace --compiler=mingw32 gcc13

其中最后一个参数是我要测试的编译器

#!/usr/bin/env python

from setuptools import setup
from setuptools import Extension

from Cython.Build import cythonize
from Cython.Distutils import build_ext

import numpy as np
import os
import shutil
import sys
import getpass


MODULE = 'loop_cython_%s'

GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables']
GCC_EXTRA_LINK_ARGS = ['-flto'] + GCC_EXTRA_COMPILE_ARGS

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL']
MSVC_EXTRA_LINK_ARGS = MSVC_EXTRA_COMPILE_ARGS


def remove_builds(kind):

    for folder in ['build', 'bin']:
        if os.path.isdir(folder):
            if folder == 'bin':
                continue
            shutil.rmtree(folder, ignore_errors=True)

    if os.path.isfile(MODULE + '_%s.c'%kind):
        os.remove(MODULE + '_%s.c'%kind)


def setenv(extra_args, doset=True, path=None, kind='gcc8'):

    flags = ''
    if doset:
        flags = '  '.join(extra_args)

    for key in ['CFLAGS', 'FFLAGS', 'CPPFLAGS']:
        os.environ[key] = flags

    user = getpass.getuser()

    if doset:
        path = os.environ['PATH']
        if kind == 'gcc8':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_8.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc11':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc12':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_12.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'gcc13':
            os.environ['PATH'] = r'C:\Users\%s\Tools\MinGW64_13.2.0\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\%s\WinPython39\WPy64-39100\python-3.9.10.amd64;'%(user, user)
        elif kind == 'msvc':
            os.environ['PATH'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\bin\Hostx64\x64;C:\WINDOWS\system32;C:\WINDOWS;C:\Users\J0514162\WinPython39\WPy64-39100\python-3.9.10.amd64;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22000.0\x64'
            os.environ['LIB'] = r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\lib\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64'
            os.environ["DISTUTILS_USE_SDK"] = '1'
            os.environ["MSSdk"] = '1'

    else:
        os.environ['PATH'] = path

    return path

class CustomBuildExt(build_ext):
    def build_extensions(self):
        # Override the compiler executables. Importantly, this
        # removes the "default" compiler flags that would
        # otherwise get passed on to to the compiler, i.e.,
        # distutils.sysconfig.get_var("CFLAGS").
        self.compiler.set_executable("compiler_so", "gcc -mdll -O -Wall -DMS_WIN64")
        self.compiler.set_executable("compiler_cxx", "g++ -O -Wall -DMS_WIN64")
        self.compiler.set_executable("linker_so", "gcc -shared -static")
        self.compiler.dll_libraries = []
        build_ext.build_extensions(self)

if __name__ == '__main__':

    os.system('cls')

    kind = None
    for arg in sys.argv:
        if arg.strip() in ['gcc8', 'gcc11', 'gcc12', 'gcc13', 'msvc']:
            kind = arg
            sys.argv.remove(arg)
            break

    base_file = os.path.join(os.getcwd(), MODULE[0:-3])


    source = base_file + '.pyx'
    target = base_file + '_%s.pyx'%kind
    shutil.copyfile(source, target)

    if kind == 'msvc':
        extra_compile_args = MSVC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = MSVC_EXTRA_LINK_ARGS[:] + ['/MANIFEST']
    else:
        extra_compile_args = GCC_EXTRA_COMPILE_ARGS[:]
        extra_link_args = GCC_EXTRA_LINK_ARGS[:]

    path = setenv(extra_compile_args, kind=kind)
    remove_builds(kind)

    define_macros = [('WIN32', 1)]

    nname = MODULE%kind

    include_dirs = [np.get_include()]
    if kind == 'msvc':
        include_dirs += [r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt',
                         r'C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.35.32215\include',
                         r'C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared']

    extensions = [
        Extension(nname, [nname + '.pyx'],
                  extra_compile_args=extra_compile_args,
                  extra_link_args=extra_link_args,
                  include_dirs=include_dirs,
                  define_macros=define_macros)]

    # build the core extension(s)
    setup_kwargs = {'ext_modules': cythonize(extensions,
                                             compiler_directives={'embedsignature'  : False,
                                                                  'boundscheck'     : False,
                                                                  'wraparound'      : False,
                                                                  'initializedcheck': False,
                                                                  'cdivision'       : True,
                                                                  'language_level'  : '3str',
                                                                  'nonecheck'       : False},
                                             force=True,
                                             cache=False,
                                             quiet=False)}

    if kind != 'msvc':
        setup_kwargs['cmdclass'] = {'build_ext': CustomBuildExt}

    setup(**setup_kwargs)

    setenv([], False, path)
    remove_builds(kind)

Test code:

import os
import numpy
import time

import loop_cython_msvc as msvc
import loop_cython_gcc8 as gcc8
import loop_cython_gcc11 as gcc11
import loop_cython_gcc12 as gcc12
import loop_cython_gcc13 as gcc13

              # M     N     NNZ(matrix) NNZ(priority) NNZ(ub)
DIMENSIONS = [(1661 , 2608 , 3560      , 375         , 2488 ),
              (2828 , 3512 , 4333      , 413         , 2973 ),
              (780  , 985  , 646       , 23          , 984  ),
              (799  , 1558 , 1883      , 301         , 1116 ),
              (399  , 540  , 388       , 44          , 517  ),
              (10545, 10486, 14799     , 1053        , 10041),
              (3369 , 3684 , 3684      , 256         , 3242 ),
              (2052 , 5513 , 4772      , 1269        , 3319 ),
              (224  , 628  , 1345      , 396         , 594  ),
              (553  , 1475 , 1315      , 231         , 705  )]


def RunTest():

    print('M      N      NNZ     MSVC     GCC 8    GCC 11   GCC 12   GCC 13')

    for m, n, nnz_mat, nnz_priority, nnz_ub in DIMENSIONS:
        print('%-6d %-6d %-8d'%(m, n, nnz_mat), end='')

        for solver, label in zip([msvc, gcc8, gcc11, gcc12, gcc13], ['MSVC', 'GCC 8', 'GCC 11', 'GCC 12', 'GCC 13']):

            numpy.random.seed(123456)
            size = m*n
            idxes = numpy.arange(size)
            matrix = numpy.zeros((size, ), dtype=numpy.float32)
            idx_mat = numpy.random.choice(idxes, nnz_mat)
            matrix[idx_mat] = numpy.random.uniform(0, 1000, size=(nnz_mat, ))

            matrix = numpy.asfortranarray(matrix.reshape((m, n)))

            idxes = numpy.arange(m)
            priority = numpy.zeros((m, ), dtype=numpy.float32)
            idx_pri = numpy.random.choice(idxes, nnz_priority)
            priority[idx_pri] = numpy.random.uniform(0, 1000, size=(nnz_priority, ))

            idxes = numpy.arange(n)
            ub_values = numpy.inf*numpy.ones((n, ), dtype=numpy.float32)
            idx_ub = numpy.random.choice(idxes, nnz_ub)
            ub_values[idx_ub] = numpy.random.uniform(0, 1000, size=(nnz_ub, ))

            solver = solver.CythonLoops()
            time_tot = []

            for i in range(20):

                start = time.perf_counter()
                solver.run(matrix, ub_values, priority)
                elapsed = time.perf_counter() - start
                time_tot.append(elapsed*1e3)

            print('%-8.4g'%numpy.mean(time_tot), end=' ')

        print()

if __name__ == '__main__':

    os.system('cls')
    RunTest()

EDIT

在@PeterCordes comments 之后,我将优化标志更改为:

MSVC_EXTRA_COMPILE_ARGS = ['/O2', '/GS-', '/fp:fast', '/Ob2', '/nologo', '/arch:AVX512', '/Ot', '/GL', '/QIntel-jcc-erratum']
GCC_EXTRA_COMPILE_ARGS = ['-Ofast', '-funroll-loops', '-flto', '-ftree-vectorize', '-march=native', '-fno-asynchronous-unwind-tables', '-Wa,-mbranches-within-32B-boundaries']

MSVC似乎比以前略快(在5%到10%之间),但GCC12和GCC13比以前慢(在3%到20%之间).下面是一个带有最大2D矩阵结果的图表:

Note:"当前"表示@PeterCordes建议的最新优化标志,"之前"是原始标志集.

enter image description here

推荐答案

这是一个部分的答案,提供了由Cython生成的C代码,一旦它被稍微简化为更短、更易读、更容易编译,而不需要任何Cython/Python/Numpy依赖(转换不会对计时产生重大影响).它还显示了生成的汇编代码和很少的注释.

以下是C代码:

#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <float.h>

/* Remove external dependencies */
typedef int64_t Py_ssize_t;
typedef void memoryview_obj;
typedef void* VTable;

typedef struct {
  struct memoryview_obj *memview;
  char *data;
  Py_ssize_t shape[8];
  Py_ssize_t strides[8];
  Py_ssize_t suboffsets[8];
} memviewslice;

struct Self
{
    /* PyObject_HEAD */
    struct VTable* tab;
    int m;
    int n;
    memviewslice col_starts;
    memviewslice row_indices;
    memviewslice x;
};

static float test_SMALL = 1e-10;

extern void INC_MEMVIEW(memviewslice* slice, int count);
extern void XDEC_MEMVIEW(memviewslice* slice, int count);

void run(struct Self* self, memviewslice matrix, memviewslice ub_values, memviewslice priority)
{
    Py_ssize_t i;
    Py_ssize_t j;
    Py_ssize_t m;
    Py_ssize_t n;
    int nza;
    int collen;
    double too_large;
    double ok;
    double obj;
    float ub;
    float element;
    memviewslice col_starts = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice row_indices = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice x = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_1 = { 0, 0, { 0 }, { 0 }, { 0 } };
    memviewslice t_2 = { 0, 0, { 0 }, { 0 }, { 0 } };
    Py_ssize_t t_3;
    Py_ssize_t t_4;
    Py_ssize_t t_6;
    Py_ssize_t t_7;
    int t_9;
    Py_ssize_t t_10;
    Py_ssize_t t_11;

    t_1 = self->col_starts;
    INC_MEMVIEW(&t_1, 1);
    col_starts = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_1 = self->row_indices;
    INC_MEMVIEW(&t_1, 1);
    row_indices = t_1;
    t_1.memview = NULL;
    t_1.data = NULL;

    t_2 = self->x;
    INC_MEMVIEW(&t_2, 1);
    x = t_2;
    t_2.memview = NULL;
    t_2.data = NULL;

    m = matrix.shape[0];
    n = matrix.shape[1];
    self->m = m;
    self->n = n;
    nza = 0;
    collen = 0;

    t_3 = n;
    t_4 = t_3;
    for (i = 0; i < t_4; i++)
    {
        t_6 = m + 1;
        t_7 = t_6;

        for (j = 0; j < t_7; j++)
        {
            t_9 = (j == 0) != 0;

            if (t_9)
            {
                t_10 = i;
                element = (*((float*) ( /* dim=0 */ (priority.data + t_10 * priority.strides[0]) )));

                goto L7;
            }

            {
                t_10 = j - 1;
                t_11 = i;
                element = (*((float*) ( /* dim=1 */ (( /* dim=0 */ ((char *) (((float*) matrix.data) + t_10)) ) + t_11 * matrix.strides[1]) )));
            }

            L7:;

            t_9 = (fabsf(element) < test_SMALL) != 0;
            if (t_9)
            {
                goto L5_continue;
            }

            t_9 = ((j == 0) != 0);
            if (t_9)
            {

                obj = (double)element;

                goto L9;
            }

            {
                collen = nza + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (col_starts.data + t_11 * col_starts.strides[0]) )) = i + 1;

                t_11 = collen;
                *((int *) ( /* dim=0 */ (row_indices.data + t_11 * row_indices.strides[0]) )) = j;

                t_11 = collen;
                *((double *) ( /* dim=0 */ (x.data + t_11 * x.strides[0]) )) = ((double)element);

                nza = nza + 1;
            }

            L9:;
            L5_continue:;
        }

        t_11 = i;
        ub = (*((float*) ( /* dim=0 */ (ub_values.data + t_11 * ub_values.strides[0]) )));

        t_9 = (ub > FLT_MAX) != 0;
        if (t_9)
        {
            too_large = 0.0;
            goto L10;
        }

        t_9 = (ub > test_SMALL) != 0;
        if (t_9)
        {
            ok = (double)ub;
        }

        L10:;
    }

    XDEC_MEMVIEW(&col_starts, 1);
    XDEC_MEMVIEW(&row_indices, 1);
    XDEC_MEMVIEW(&x, 1);
}

下面是使用GCC 11.4(带有编译器选项-Ofast -funroll-loops -ftree-vectorize -march=skylake -fno-asynchronous-unwind-tables)在汇编中编译的结果主循环:

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm4, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm5, xmm4, xmm1
        vcomiss xmm9, xmm5
        ja      .L25
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm6, xmm4, xmm4
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm6
.L25:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L26
        inc     rax
        vmovss  xmm7, DWORD PTR [rcx-4+rax*4]
        vandps  xmm2, xmm7, xmm1
        vcomiss xmm9, xmm2
        ja      .L26
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm3, xmm7, xmm7
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm3
.L26:
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm8, xmm6, xmm1
        vcomiss xmm9, xmm8
        jbe     .L36
        jmp     .L5
.L7:
        vmovss  xmm10, DWORD PTR [rcx-4+rax*4]
        vandps  xmm11, xmm10, xmm1
        vcomiss xmm9, xmm11
        ja      .L24
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm12, xmm10, xmm10
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm12
.L24:
        lea     r14, [rax+1]
        vmovss  xmm13, DWORD PTR [rcx-4+r14*4]
        vandps  xmm14, xmm13, xmm1
        vcomiss xmm9, xmm14
        ja      .L27
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm15, xmm13, xmm13
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm15
.L27:
        lea     r14, [rax+2]
        vmovss  xmm0, DWORD PTR [rcx-4+r14*4]
        vandps  xmm4, xmm0, xmm1
        vcomiss xmm9, xmm4
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm5, xmm0, xmm0
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm5
.L28:
        add     rax, 3
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm9, xmm7
        ja      .L5
.L36:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L7
        lea     rax, [rsi+1]
        add     rcx, QWORD PTR [rsp+8]
        cmp     r13, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

GCC也是如此:

.L4:
        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5
        vmovss  xmm15, DWORD PTR [rcx]
        mov     eax, 1
        vandps  xmm4, xmm15, xmm1
        vcomiss xmm0, xmm4
        ja      .L27
        inc     edx
        movsx   r14, edx
        mov     r15, r9
        imul    r15, r14
        vcvtss2sd       xmm5, xmm15, xmm15
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, r14
        imul    r14, rbp
        mov     DWORD PTR [r10+r15], 1
        vmovsd  QWORD PTR [rbx+r14], xmm5
.L27:
        cmp     rdi, 1
        je      .L5
        cmp     rdi, 2
        je      .L28
        inc     rax
        vmovss  xmm6, DWORD PTR [rcx-4+rax*4]
        vandps  xmm7, xmm6, xmm1
        vcomiss xmm0, xmm7
        ja      .L28
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm2, xmm6, xmm6
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm2
.L28:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm1
        vcomiss xmm0, xmm3
        jbe     .L39
        jmp     .L5
.L41:
        vmovss  xmm8, DWORD PTR [rcx-4+rax*4]
        vandps  xmm9, xmm8, xmm1
        vcomiss xmm0, xmm9
        ja      .L26
        inc     edx
        movsx   r15, edx
        mov     r14, r9
        mov     rdi, r11
        imul    r14, r15
        imul    rdi, r15
        imul    r15, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm10, xmm8, xmm8
        mov     DWORD PTR [r10+rdi], eax
        vmovsd  QWORD PTR [rbx+r15], xmm10
.L26:
        lea     r14, [rax+1]
        vmovss  xmm11, DWORD PTR [rcx-4+r14*4]
        vandps  xmm12, xmm11, xmm1
        vcomiss xmm0, xmm12
        ja      .L29
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm13, xmm11, xmm11
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm13
.L29:
        lea     r14, [rax+2]
        vmovss  xmm14, DWORD PTR [rcx-4+r14*4]
        vandps  xmm15, xmm14, xmm1
        vcomiss xmm0, xmm15
        ja      .L30
        inc     edx
        movsx   rdi, edx
        mov     r15, r9
        imul    r15, rdi
        vcvtss2sd       xmm4, xmm14, xmm14
        mov     DWORD PTR [r8+r15], esi
        mov     r15, r11
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r10+r15], r14d
        vmovsd  QWORD PTR [rbx+rdi], xmm4
.L30:
        add     rax, 3
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm1
        vcomiss xmm0, xmm6
        ja      .L5
.L39:
        inc     edx
        movsx   rdi, edx
        mov     r14, r9
        mov     r15, r11
        imul    r14, rdi
        imul    r15, rdi
        imul    rdi, rbp
        mov     DWORD PTR [r8+r14], esi
        vcvtss2sd       xmm7, xmm5, xmm5
        mov     DWORD PTR [r10+r15], eax
        vmovsd  QWORD PTR [rbx+rdi], xmm7
.L5:
        inc     rax
        cmp     r12, rax
        jne     .L41
        mov     rdi, QWORD PTR [rsp+8]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     rsi, r13
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

GCC 13.1也是如此(龙珠上没有GCC 13.2):

.L4:
        mov     QWORD PTR [rsp+16], r14
        xor     eax, eax
.L20:
        mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6
        inc     rax
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm4, xmm15, xmm15
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r14], xmm4
        cmp     r8, 1
        je      .L6
        cmp     r8, 2
        je      .L21
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        inc     edx
        mov     r14, r10
        vcvtss2sd       xmm7, xmm5, xmm5
        movsx   r8, edx
        imul    r14, r8
        mov     DWORD PTR [rdi+r14], esi
        mov     rdi, rbx
        imul    rdi, r8
        imul    r8, r13
        mov     DWORD PTR [r11+rdi], eax
        vmovsd  QWORD PTR [r12+r8], xmm7
.L21:
        inc     rax
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vandps  xmm3, xmm5, xmm2
        vcomiss xmm1, xmm3
        ja      .L20
        inc     edx
        mov     r14, r10
        mov     rdi, QWORD PTR [rsp+24]
        movsx   r8, edx
        imul    r14, r8
.L32:
        mov     r15, rbx
        mov     DWORD PTR [rdi+r14], esi
        vcvtss2sd       xmm8, xmm5, xmm5
        imul    r15, r8
        imul    r8, r13
        mov     DWORD PTR [r11+r15], eax
        vmovsd  QWORD PTR [r12+r8], xmm8
.L6:
        lea     r8, [rax+1]
        mov     rax, r8
        cmp     r9, r8
        je      .L7
        vmovss  xmm9, DWORD PTR [rcx-4+r8*4]
        vandps  xmm10, xmm9, xmm2
        vcomiss xmm1, xmm10
        ja      .L20
        inc     edx
        mov     r15, r10
        mov     rdi, QWORD PTR [rsp+24]
        vcvtss2sd       xmm11, xmm9, xmm9
        movsx   r14, edx
        mov     DWORD PTR [rsp+4], edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        inc     rax
        vmovss  xmm12, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm11
        vandps  xmm13, xmm12, xmm2
        vcomiss xmm1, xmm13
        ja      .L20
        inc     edx
        mov     r15, r10
        vcvtss2sd       xmm14, xmm12, xmm12
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+2]
        vmovss  xmm15, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm14
        vandps  xmm0, xmm15, xmm2
        vcomiss xmm1, xmm0
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r15, r10
        vcvtss2sd       xmm4, xmm15, xmm15
        add     edx, 2
        movsx   r14, edx
        imul    r15, r14
        mov     DWORD PTR [rdi+r15], esi
        mov     r15, rbx
        imul    r15, r14
        imul    r14, r13
        mov     DWORD PTR [r11+r15], eax
        lea     rax, [r8+3]
        vmovss  xmm5, DWORD PTR [rcx-4+rax*4]
        vmovsd  QWORD PTR [r12+r14], xmm4
        vandps  xmm6, xmm5, xmm2
        vcomiss xmm1, xmm6
        ja      .L20
        mov     edx, DWORD PTR [rsp+4]
        mov     r14, r10
        add     edx, 3
        movsx   r8, edx
        imul    r14, r8
        jmp     .L32
.L7:
        mov     rdi, QWORD PTR [rsp+8]
        mov     r14, QWORD PTR [rsp+16]
        lea     rax, [rsi+1]
        add     rcx, rdi
        cmp     r14, rsi
        je      .L2
        mov     rsi, rax
        jmp     .L4
.L2:

讨论

首先,我们可以看到,主循环中充斥着混合了goto的条件分支.这类代码通常不是人类编写的.编译器可以预料到它们是稀有的,因此优化它们的效率低于没有goto的(因为断言更容易计算).问题是,j == 0条件在这里并不真正有用:它只能在基于j的循环之前执行一次.GCC似乎看不到这一点(特别是由于Cythongoto指令放大了复杂的控制流).除此之外,编译器不知道fabsf(element) < SMALL是否经常是真的还是经常是假的,或者甚至在运行时很难预测.如果循环很容易预测(前两种情况),那么一个好的编译器可以编写比GCC生成的代码更高效的代码(可能使用SIMD指令).

实际上,由于复杂的控制流,我甚至不确定-funroll-loops在这里是否有用.它使热环明显变大(即.高速缓存中的更多空间),并且处理器可能花费更多时间来预测被采用/未被采用分支的概率.这取决于目标处理器.拥有较小的代码也有利于分析和优化您的代码:更简单的代码更容易优化.

为了优化代码,编译器对循环的成本和要采用的分支的概率进行假设.例如,IF可以假设内部循环IF比外部循环要昂贵得多.启发式算法被用来优化代码,并仔细调整它们的参数,以最大限度地提高基准测试套件的性能(并避免像GCC的Linux内核那样的代码中的回归).因此,副作用是使用较新版本的GCC或大多数编译器时,您的代码可能会更慢,而大多数代码可能会更快.这也是给定编译器的性能从一个版本到另一个版本不是很稳定的原因之一.例如,如果一个矩阵有很多零,那么fabsf(element) < SMALL通常是真的,这会导致外循环比预期的更昂贵.问题是,在GCC 13.2中,外部循环可能没有GCC 12.2中的优化得那么好,从而导致GCC 13.2中的执行时间较慢.例如,GCC 13.2以以下命令开始外环:

        mov     r8, rax
        sub     r8, r9
        not     r8
        and     r8d, 3
        je      .L6

请注意,像Skylake这样的现代处理器可以在一个周期内并行执行多条指令.由于r8寄存器上的相依性链,此指令序列只能按顺序执行.与此同时,GCC 12.2在外环开始时产生了这样的结果:

        lea     rdi, [r12-1]
        xor     eax, eax
        and     edi, 3
        je      .L5

在这种情况下,依赖链(在rdi/edi上)明显要小得多.在实践中,汇编代码相当大,所以我不认为这是唯一的问题.您需要跟踪GCC所有版本中输入矩阵的实际路径(非常繁琐),或者只运行代码并使用VTune(本例中推荐的解决方案)等工具对其进行分析.

我们可以看到,有很多imul个说明.这类指令通常是由于在编译时未知的数组步长造成的.GCC没有为步长为1的代码生成代码.如果可能,您应该在内存视图中使用::1(在注释中由@ead提到),因为imul条指令在这样的热循环中往往有点昂贵(更不用说它们还会增加代码的大小,因为部分展开已经更大了).

Python相关问答推荐

当密钥是复合且唯一时,Pandas合并抱怨标签不唯一

在函数内部使用eval(),将函数的输入作为字符串的一部分

根据不同列的值在收件箱中移动数据

将数据框架与导入的Excel文件一起使用

Python库:可选地支持numpy类型,而不依赖于numpy

将pandas Dataframe转换为3D numpy矩阵

删除字符串中第一次出现单词后的所有内容

SQLAlchemy bindparam在mssql上失败(但在mysql上工作)

无论输入分辨率如何,稳定扩散管道始终输出512 * 512张图像

如何使用两个关键函数来排序一个多索引框架?

ruamel.yaml dump:如何阻止map标量值被移动到一个新的缩进行?

处理Gekko的非最优解

为什么t sns.barplot图例不显示所有值?'

计算机找不到已安装的库'

将数字数组添加到Pandas DataFrame的单元格依赖于初始化

如果服务器设置为不侦听创建,则QWebSocket客户端不连接到QWebSocketServer;如果服务器稍后开始侦听,则不连接

函数()参数';代码';必须是代码而不是字符串

具有不同坐标的tkinter canvs.cocords()和canvs.moveto()

如何在Django查询集中生成带有值列表的带注释的字段?

将时间序列附加到数据帧