C++ 标准矩阵乘法算法的运行时间比按比例放大时的预期慢(2^10+ x 2^10+ 元素)

发布于03月23日

我有一个标准的矩阵乘法算法.

// matrix.h

typedef struct Matrix {
    int rows;
    int columns;
    double *elements;
} Matrix;

double *addressAt(Matrix *matrix, int row, int column);
double *rowAt(Matrix *matrix, int row);
double dotProductColumnArray(Matrix *matrix, const int column, double *array);
void mmProduct(Matrix *first, Matrix *second, Matrix *result);

// matrix.c

// Helper function to find address of elements at row and column.
inline double *addressAt(Matrix *matrix, int row, int column) {
    return &(matrix->elements[row * matrix->columns + column]);
}

// Helper function to find row array
inline double *rowAt(Matrix *matrix, int row) {
    return &(matrix->elements[row * matrix->columns]);
}

// Finds the dot product of a column of a matrix and an array.
double dotProductColumnArray(Matrix *matrix, const int column, double *array) {
    double sum = 0.0;
    const int rows = matrix->rows, columns = matrix->columns;
    int j = column;
    const double *elements = matrix->elements;
    for (int i = 0; i < rows; i++) {
        sum += array[i] * elements[j];
        j += columns;
    }
    return sum;
}

void mmProduct(Matrix *first, Matrix *second, Matrix *result) {
    const int rows = result->rows;
    const int columns = result->columns;
    for (int i = 0; i < rows; i++) {
        double *row = rowAt(first, i);
        for (int j = 0; j < columns; j++) {
            *addressAt(result, i, j) = dotProductColumnArray(second, j, row);
        }
    }
}

// Snippet of main.c

// Fills the matrix with random numbers between -1 and 1
void randomFill(Matrix *matrix);

int main() {
    struct timeval timestamp;
    long start, now;
    Matrix first, second, result;
    
    // 2^10
    first = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    second = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    randomFill(&first);
    randomFill(&second);
    result = createMatrix((int)pow(2, 10), (int)pow(2, 10));

    gettimeofday(&timestamp, NULL);
    start = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    mmProduct(&first, &second, &result);
    gettimeofday(&timestamp, NULL);
    now = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    printf("Time taken: %ldms\n", now - start);
    
    deleteMatrix(&first);
    deleteMatrix(&second);
    deleteMatrix(&result);
    
    // Same code as above but for 2^11, 2^12, replacing pow() when necessary.
    // ...
    // ...
}

以下是您自己运行它的完整代码: https://gist.github.com/kevinMEH/c37bda728bf5ee53c32a405ef611e5a7个

当我用两个2¹⁰x 2¹⁰矩阵运行该算法时，乘法大约需要2秒.

当我使用两个2¹¹x2¹¹矩阵运行该算法时，我预计该算法会在8*2秒~16秒内将它们相乘.然而，它需要36秒才能进行倍增.

当我使用两个2¹²x2¹²矩阵运行该算法时，我预计该算法将它们相乘所需的时间为64*2000毫秒~128秒.最坏的情况是，我预计8*36=288秒会成倍增加.然而，它需要391秒才能倍增.

(使用Apple Clang 14.0.0和-O1编译，但使用-Ofast并且没有优化，它的伸缩性类似，仍然不是8的倍数，但更高.运行在Apple M1芯片上.)

我很困惑为什么会这样.执行的加法和乘法运算的次数正好是(2ⁿ)³，当我将矩阵的宽度增加2倍时，我预计后续的乘法运算每次只需要2=8倍的时间.

嫌疑人:

addressAt和rowAt帮助器函数是导致速度减慢的原因:这没有意义；加法和乘法的数量按n³zoom ，但调用addressAt的时候仅按n²zoom ，而rowAt的调用仅按n²zoom ，因此我们实际上应该看到速度的相对提高.(8*(n³+n²+n)&gt；8n³+4n²+2n) 我还手动内联了函数，发现没有明显的速度提高，所以应该不是因为函数调用开销.

非正态数:我的随机数生成器得到一个随机的U64，并将其除以2*0x8..0u，得到一个介于0和1之间的数字，因此应该不存在非正态数.尽管如此，在随机填充函数中，我将所有内容乘以100000，结果仍然是相同的.

我怀疑这就是罪魁祸首，但我找不到原因. 我最好的猜测是，速度变慢是因为CPU预取了错误的列元素.当我们按列数递增j时，CPU对此感到困惑，不会预取matrix->elements[j + n * columns]处的数组元素，因为它不希望j按列递增(所以它预取j、j + 1或其他任何值，但不是j + columns)，但这对我来说没有意义，因为columns是常量，所以CPU应该知道得更清楚.

Side note:

我还编写了实现Strassen matrix multiplication algorithm的mmStrassen()，当我测试它时，它每次都出乎意料地长7倍.它是在常规的mmProduct函数之后一起测试的，所以它不是运行时间问题.

另外，欢迎任何优化建议，谢谢.

对于这个长长的问题，我要提前表示感谢和歉意.

// Original: for(int i = 0; i < rows; i++) { double* row = &(first->elements[i * firstColumns]); for(int j = 0; j < columns; j++) { for(int k = 0; k < firstColumns; k++) { /* The result[i][j] memory address remains constant, which is not desirable because no SIMD, all operations performed sequentially on a single result[i][j] memory address The elements in the row array used one after the other, which is good. The elements in the column array are spaced out by the value of columns, which is bad. */ result->elements[i * columns + j] += row[k] * second->elements[k * columns + j]; } } } // Loop swapped: for(int i = 0; i < rows; i++) { double* row = &(first->elements[i * firstColumns]); for(int k = 0; k < firstColumns; k++) { for(int j = 0; j < columns; j++) { /* The result[i][j] memory address is used once, and then subsequent result[i][j + n] addresses are used which enables SIMD The row element is constant, which is very good! The elements in the column array are used one after the other, which is good. */ result->elements[i * columns + j] += row[k] * second->elements[k * columns + j]; } } }

void mmProduct(Matrix*__restrict first, Matrix*__restrict second, Matrix*__restrict result) { const int rows = result->rows; const int columns = result->columns; const int size = rows * columns; const int firstColumns = first->columns; for(int i = 0; i < size; i++) { result->elements[i] = 0.0; } for(int i = 0; i < rows; i++) { double* row = &(first->elements[i * firstColumns]); for(int k = 0; k < firstColumns; k++) { for(int j = 0; j < columns; j++) { result->elements[i * columns + j] += row[k] * second->elements[k * columns + j]; } } } } void mmProductWT(Matrix*__restrict first, Matrix*__restrict second, Matrix*__restrict result) { const int rows = result->rows; const int columns = result->columns; const int operandColumns = first->columns; double* secondTransposeElements = (double*) calloc(columns * second->rows, sizeof(double)); Matrix secondTranspose = { columns, second->rows, secondTransposeElements }; transpose(second, &secondTranspose); for(int i = 0; i < rows; i++) { for(int j = 0; j < columns; j++) { double sum = 0.0; for(int k = 0; k < operandColumns; k++) { result->elements[i * columns + j] += first->elements[i * operandColumns + k] * secondTranspose.elements[j * operandColumns + k]; } result->elements[i * columns + j] = sum; } } free(secondTransposeElements); }

C++ 标准矩阵乘法算法的运行时间比按比例放大时的预期慢(2^10+ x 2^10+ 元素)

嫌疑人:

Side note:

推荐答案

Problem

Swapping the k and j loops:

Rewritten functions:

C++相关问答推荐

为什么海湾合作委员会在共享对象中的. init_data的虚拟内存地址之前留出一个空白

如果我释放其他内容，返回值就会出错

当我运行/调试C程序时，Malloc()似乎正在将&q；r\r...&q；赋值给一个指针，我不确定为什么？

struct 上的OpenMP缩减

向上强制转换C中的数值类型总是可逆的吗？

如何在C++中处理按键

是否可以通过调用两个函数来初始化2D数组？示例：ARRAY[STARTING_ROWS()][STARTING_COLUMNS()]

每次除以或乘以整数都会得到0.0000

为什么函数是按照定义的顺序执行的，而不是按照从avr-c中的int main()调用的顺序执行的？

在txt文件中找到指定的字符串，并从数字中减go 相同的值

在C++中允许使用字符作为宏参数

是否定义了此函数的行为？

在吉陀罗中，_2_1_和CONCAT11是什么意思？

向左移位3如何得到以字节为单位的位数？

OMP并行嵌套循环

为什么GCC 13没有显示正确的二进制表示法？

变量值不正确的问题

传递参数：C 和 C++ 中 array 与 *&array 和 &array[0] 的区别

在 C 中传递参数时出现整数溢出

全局变量 y0 与 mathlib 冲突，无法编译最小的 C 代码