I'm trying to parallelize a simple python program using numba. It consists of two functions. The first one is the simple power method

@numba.jit(nopython=True)
def power_method(A, v):
    u = v.copy()
    for i in range(3 * 10**3):
        u = A @ u
        u /= np.linalg.norm(u)
    return u

And the second function iterates with the vector v over the grid and runs the power_method function for various vectors v.

@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)
    
    for i1 in range(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                v[0] = v1
                v[1] = v2
                v[2] = v3

                u = power_method(A, v)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx] = u.copy()
    
    return results

Then I run it with

n = 3
A = np.random.randn(n, n)
iterate_grid(A, 5.0, 20)

All iterations are independent. Moreover, the calculations ideally fall into the cache, so I would expect that parallelizing the first loop with prange will give approximately linear acceleration.

However, the wall time for the sequential code is 6.07 s, while for the parallel code

@numba.jit(nopython=True, parallel=True)
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)
    
    for i1 in numba.prange(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                v[0] = v1
                v[1] = v2
                v[2] = v3

                u = power_method(A, v)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx] = u.copy()
    
    return results

the wall time is 7.79 s. That is, parallelization slows down the code in this case.

Moreover, as I can see from iterate_grid.parallel_diagnostics(level=4), numba fuses tmp = np.linspace(-scale, scale, sz) and for i1 in range(sz): which is incorrect, because I need all values of tmp in the inner loops.

Parallelizing the power method is not the task I'm actually trying to solve, it's just a small example on which numba does not behave as I expect. Can you explain to me this behavior of numba and advise me how I can parallelize iterations on the grid to get linear acceleration?

Thank you in advance for your help!

推荐答案

Correctness prevails over performance. Thus, you need to check results are actually correct before trying the measure the performance of the implementation. And it turns out the result between the sequential and parallel implementations are different (you can use np.allclose to check that or plot the array differences). Since the parallel code is basically the same except the use of prange, a race condition is very likely to be the cause. Running multiple times the program and checking if results are different is a good way to check for the probable presence of race condition (note the absence of difference does not mean there are no race conditions and other problems can cause non-deterministic results). A practical test on my machine gives very different results.

The thing is your code looks completely correct! Generally this means that the problem is due to undefined behaviours or compiler/runtime bugs. A deeper analysis show that the problem does not appear if tmp = np.linspace(-scale, scale, sz) is moved within the parallel loop. This is certainly due to a bug in Numba.

Assuming the results would be correct, the performance is strongly dependant of the power_method function which is not efficient. Indeed, neither Numba nor Numpy are optimized for very small arrays. Numba does not often optimize out array allocations except in some specific cases (like explicit allocations in a prange-based loop with detected invariants or similar cases as pointed out by @max9111): arrays are allocated on the heap using a quite slow allocation method. On top of that allocations do not scale preventing any parallel speed up. An optimization is to do the operation manually using 3 variables. Here is an optimized implementation:

@numba.njit
def fast_power_method(A, v1, v2, v3):
    # Unpacking
    # Note: there is no need for a copy here
    u1, u2, u3 = v1, v2, v3
    A11, A12, A13 = A[0]
    A21, A22, A23 = A[1]
    A31, A32, A33 = A[2]

    for i in range(3_000):
        # Optimized matrix multiplication
        t1 = u1 * A11 + u2 * A12 + u3 * A13
        t2 = u1 * A21 + u2 * A22 + u3 * A23
        t3 = u1 * A31 + u2 * A32 + u3 * A33

        # Renormalization
        # Note: multiplications are faster than divisions
        norm = np.sqrt(t1**2 + t2**2 + t3**2)
        inv_norm = 1.0 / norm
        u1 = t1 * inv_norm
        u2 = t2 * inv_norm
        u3 = t3 * inv_norm

    return u1, u2, u3

@numba.njit
def iterate_grid(A, scale, sz):
    assert A.shape[0] == A.shape[1] == 3
    n = A.shape[0]

    results = np.empty((sz**3, n))
    tmp = np.linspace(-scale, scale, sz)

    for i1 in range(sz):
        v = np.empty(n, dtype=np.float64)
        v1 = tmp[i1]
        for i2, v2 in enumerate(tmp):
            for i3, v3 in enumerate(tmp):
                u1, u2, u3 = fast_power_method(A, v1, v2, v3)

                idx = i1 * sz**2 + i2 * sz + i3
                results[idx, 0] = u1
                results[idx, 1] = u2
                results[idx, 2] = u3
    
    return results

This implementation takes 0.4 seconds while the initial version takes about 10 seconds. Thus, it is 25 times faster while still being sequential. Results are slightly different due to floating-point rounding and non-associativity. Note that using prange with this version is 5 times faster on my 6-core machine so it scale well but results are still wrong for the same reason. This shows at least that allocations was certainly the bottleneck.

The rule of thumb is to check results and that micro-optimizations can result in a faster execution than using multiple cores (especially for heavily numerical codes).

UPDATE: I filled a Numba bug available here.

Python相关问答推荐

OpenCV轮廓.很难找到给定图像的所需轮廓

Python避免mypy在相互引用中从另一个类重定义类时失败

判断Python操作:如何从字面上得到所有decorator ?

Python类型提示:对于一个可以迭代的变量,我应该使用什么?

Seaborn散点图使用多个不同的标记而不是点

将字节序列解码为Unicode字符串

当lambda函数作为参数传递时,pyo3执行

如何定义一个将类型与接收该类型的参数的可调用进行映射的字典?

打印:添加具有不同填充 colored颜色 的矩形

当使用随机均匀(a,b)时,b是包含的还是排他的?

如何 Select 包含一定字符串和数字的单元格?

如何将数据从一个数据框按行添加到另一个数据框,仅当两个数据框中第一列的值相等时?

str的泛型Enum类的Python类型

如何在不窥视future 的情况下,在Python中对OHLC数据帧进行重采样?

TypeError:Py集群库中未调整大小的对象的Len()

Python数据类和SQLITE3适配器

巨 Python 品脱摄氏度单位

我想把字转换成8位二进制,但某些字符是7位的

如何分解在某些行中为空,但在其他行中填充的pandas框架

将MultiIndex列的级别转换为具有值的列(取消堆叠列)