C++ OpenMP 基本温度建模，并行处理时间比串行代码长

发布于07月14日

我是OpenMP新手，我必须完成这个项目，在这个项目中，我需要使用雅可比迭代法求解2D矩阵，以使用OpenMP解决导热性问题.

基本上，它是一个在侧面有四个壁的板，有固定的温度，我需要计算出中间在中间.

代码已经交给了我们，我要做的是三件简单的事情:

对串行代码计时
并行串行代码并比较
如果可能，进一步优化并行代码

我已经运行了串行代码，并将代码并行化以进行比较.

我将try 对两者进行编译器优化，但我希望并行代码更快.

有趣的是，我添加的线程越多，速度就越慢.

我理解对于一个小的问题规模来说，这将是一个更大的开销，但我认为这是一个足够大的问题规模？

这是在任何重大优化之前，我是否做了一些明显错误的事情，使其成为这样？

代码如下:

序列号:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[])
{

    int m; 
    int n;
    double tol;// = 0.0001;

    int i, j, iter;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);

    /**
     * @var t, tnew, diff, diffmax,
     * t is the old temprature array,tnew is the new array
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, difmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    printf("%d %d %lf\n",m,n, tol);

    start = omp_get_wtime();

    // initialise temperature array
    for (i=0; i <= m+1; i++) {
        for (j=0; j <= n+1; j++) {
            t[i][j] = 30.0;
        }
    }

    // fix boundary conditions
    for (i=1; i <= m; i++) {
        t[i][0] = 33.0;
        t[i][n+1] = 42.0;
    }
    for (j=1; j <= n; j++) {
        t[0][j] = 20.0;
        t[m+1][j] = 25.0;
    }

    // main loop
    iter = 0;
    difmax = 1000000.0;
    while (difmax > tol) {

        iter++;

        // update temperature for next iteration
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        difmax = 0.0;
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                diff = fabs(tnew[i][j]-t[i][j]);
                if (diff > difmax) {
                    difmax = diff;
                }
                // copy new to old temperatures
                t[i][j] = tnew[i][j];
            }
        }

    }

    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  difmax = %9.11lf\n", iter, difmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

以下是并行代码:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>



int main(int argc, char *argv[])
{
    int m; 
    int n;
    double tol;// = 0.0001;

    /**
     * @brief Integer variables
     * @var i external loop (y column array) counter,
     * @var j internal loop (x row array counter) counter,
     * @var iter number of iterations,
     * @var numthreads number of threads
     */
    int i, j, iter, numThreads;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);
    numThreads = atoi(argv[4]);

    /**
     * @brief Double variables
     * @var t, tnew -> The variable that holds the temprature, the t is the old value and the tnew is the new value,
     * @var diff Measures the difference,
     * @var diffmax 
     * t is the temprature array, I guess it holds the matrix?
     * 
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, diffmax, privDiffmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    /**
     * @brief Print the problem size & the tolerance
     * This print statement can be there as it is not part of the parallel region
     * We also print the number of threads when printing the problem size & tolerance
     */
    //printf("%d %d %lf %d\n",m,n, tol, numThreads);
    omp_set_num_threads(numThreads);

    /**
     * @brief Initialise the timer
     * 
     */
    start = omp_get_wtime();

    /**
     * @brief Creating the parallel region:
     * Here both loop counters are private:
     */
    #pragma omp parallel private(i, j)
    {
        /**
         * @brief initialise temperature array
         * This can be in a parallel region by itself
         */
        #pragma omp for collapse(2) schedule(static)
        for (i=0; i <= m+1; i++) {
            for (j=0; j <= n+1; j++) {
                t[i][j] = 30.0;
            }
        }

        // fix boundary conditions
        #pragma omp for schedule(static)
        for (i=1; i <= m; i++) {
            t[i][0] = 33.0;
            t[i][n+1] = 42.0;
        }

        #pragma omp for schedule(static)
        for (j=1; j <= n; j++) {
            t[0][j] = 20.0;
            t[m+1][j] = 25.0;
        }

    }   

    // main loop
    iter = 0;
    diffmax = 1000000.0;

    while (diffmax > tol) {

        iter = iter + 1;

        /**
         * @brief update temperature for next iteration
         * Here we have created a parallel for directive, this is the second parallel region
         */
        #pragma omp parallel for private(i, j) collapse(2) schedule(static)
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        diffmax = 0.0;
        
        /**
         * @brief Third parallel region that compares the difference
         */
        #pragma omp parallel private(i, j, privDiffmax, diff)
        {
            privDiffmax = 0.0;
            #pragma omp for collapse(2) schedule(static)
            for (i=1; i <= m; i++) {
                for (j=1; j <= n; j++) {
                    diff = fabs(tnew[i][j]-t[i][j]);
                    if (diff > privDiffmax) {
                        privDiffmax = diff;
                    }
                    // copy new to old temperatures
                    t[i][j] = tnew[i][j];
                }
            }
            #pragma omp critical
            if (privDiffmax > diffmax)
            {
                diffmax = privDiffmax;
            }
        }
        

    }

    //Add timer for the end
    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  diffmax = %9.11lf\n", iter, diffmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

以下是一些串行代码的基准测试:

我已经运行了代码并进行了测试，我已经注释掉了打印语句，因为除了测试之外，我不需要看到它.

我有一个8核的苹果Mac M1

我是OpenMP新手，忍不住觉得自己错过了什么.

.LBB0_27: # Parent Loop BB0_15 Depth=1 movsd xmm3, qword ptr [rbx + 8*rsi] # xmm3 = mem[0],zero movapd xmm5, xmm3 subsd xmm5, qword ptr [rdi + 8*rsi] andpd xmm5, xmm0 maxsd xmm5, xmm2 movsd qword ptr [rdi + 8*rsi], xmm3 add rsi, 1 movapd xmm2, xmm5 cmp r12, rsi jne .LBB0_27

.LBB3_4: mov rax, rcx cqo idiv r12 # <------------------- shl rax, 32 add rax, rdi sar rax, 32 mov rbp, rax imul rbp, r13 # <------------------- shl rdx, 32 add rdx, rdi sar rdx, 32 add rbp, rdx movsd xmm2, qword ptr [r9 + 8*rbp] # xmm2 = mem[0],zero imul rax, r8 # <------------------- add rax, rdx movapd xmm3, xmm2 subsd xmm3, qword ptr [rsi + 8*rax] andpd xmm3, xmm0 maxsd xmm3, xmm1 movsd qword ptr [rsi + 8*rax], xmm2 add rcx, 1 movapd xmm1, xmm3 cmp rbx, rcx jne .LBB3_4

关于性能的附加说明

为了获得更好的性能，consider using 100 or even 101.也可以考虑使用102.如果您不使用像NaN这样的特殊浮点(FP)值，并且不关心FP关联性，103也会有所帮助.不要每次都复制数组，并且使用double-buffering方法也有很大帮助(内存限制代码的扩展性不好).然后考虑阅读一篇研究论文以获得更好的性能(trapezoidal tiling可以用来进一步提高性能，但这是非常复杂的).还请注意，不使用collapse(2)可以减少并行量，这在具有大量内核的处理器上可能是一个问题，但实际上，在小数组上运行大量内核往往速度较慢(因为虚假共享和通信).

M1处理器的特别说明

M1处理器基于Big/Little architecture.由于少数几个运行速度快的"大"内核(但也会消耗大量空间和能量)，这种架构有助于加快顺序代码的速度.然而，高效运行并行核是困难的，因为"小"核(节能且小)比大的核慢得多，如果所有类型的核同时运行(IDK，如果M1默认情况下是这样).一种解决方案是控制执行，以便只使用相同类型的内核.另一种解决方案是使用动态调度，以便在运行时自动平衡工作(例如，使用第schedule(guided)条甚至schedule(dynamic)条).第二种解决方案往往会增加大量开销，并且已知会在(基于NUMA的)计算服务器(甚至是最近的AMD PC处理器)上引起其他棘手的问题.还需要注意的是，由于大小核之间的性能差异，扩展将与线程数不成线性关系.由于上述问题，目前许多应用程序对这种架构的支持效率很低.

C++ OpenMP 基本温度建模，并行处理时间比串行代码长

推荐答案

后果

关于性能的附加说明

M1处理器的特别说明

C++相关问答推荐

try 使用sigqueue函数将指向 struct 体的指针数据传递到信号处理程序，使用siginfo_t struct 体从一个进程传递到另一个进程

为什么在Linux(特别是Ubuntu 20.04LTS)上，POSIX共享内存对象在重启后仍然存在，然后突然变成了根用户？

我应该如何解决我自己为iOS编译的xmlsec1库的问题？转换Ctx.first在xmlSecTransformCtxPrepare()之后为空

如何跨平台处理UTF-16字符串？

将 struct 变量赋给自身(通过指针取消引用)是否定义了行为？

将宏值传递给ARM链接器，该链接器将变量放置在特定位置

Make Node函数.S有什么问题吗？

如何使用[BTStack]BLE发送大型(>；2kb)信息包

在基本OpenGL纹理四边形中的一个三角形中进行渲染

什么是.c.h文件？

如何在C中只对字符串(包含数字、单词等)中的数字进行重复操作？

我的C函数起作用了，但我不确定为什么

如何在GDB中查看MUSL的源代码

试图创建一个基本的Word克隆，但遇到了障碍

将数字的每一位数平方，并使用C将它们连接为一个数字(程序不能正确处理0)

从C中的函数返回静态字符串是不是一种糟糕的做法？

为什么会导致分段故障？(C语言中的一个程序，统计文件中某个单词的出现次数)

为什么我在我的代码中得到错误和退出代码-1073741819(0xC0000005)，但如果我添加了一个不相关的打印语句，它仍然有效？

使用 _Atomic float 时，MSVC 编译的代码会命中调试断言

我们可以在不违反标准的情况下向标准函数声明添加属性吗？