使用Python,我向一个Cuda routine 发送50个图像,该 routine 计算每个图像中50个珠子的位置.下面是一张图解(我刚刚画了4个珠子):

Image

Cuda routine 需要一个展平的数组,因此我必须从图像中提取感兴趣的区域(比方说64x64),我将其展平,然后拼接.

以下是我所做的:

import numpy as np
import time
from numpy.lib.stride_tricks import sliding_window_view

# Create a numpy array of 1024 x  1280 (the image)
image = np.random.randint(0, 255, (1024, 1280), dtype=np.uint8)


# Assuming x_values and y_values are lists of x and y coordinates of those 64 x 64 region of interests.
x_values =  np.random.randint(100, 900, 50)
y_values =  np.random.randint(100, 900, 50)

# create 2 arrays for the 2 methods. Such array is used by cuda
main_array_1 = np.zeros((50*50*64*64))
main_array_2 = np.zeros((50*50*64*64))


print("#########################")
# First method 
print("#########################")
start_time = time.time()
sub_images = np.array([image[x:x+64, y:y+64] for x, y in zip(x_values, y_values)])
flattened_sub_arrays = sub_images.ravel()
main_array_1 [:len(flattened_sub_arrays)] = flattened_sub_arrays
print("--- %s seconds ---" % ((time.time() - start_time)))
print(flattened_sub_arrays.shape)


print("#########################")
# second method (Thanks to  hpaulj, see comments).
print("#########################")
# Create an array of indices for x and y
x_indices = np.array([np.arange(x, x+64) for x in x_values])
y_indices = np.array([np.arange(y, y+64) for y in y_values])

start_time = time.time()
# Create a sliding window view of the image
window_view = np.lib.stride_tricks.sliding_window_view(image, (64, 64))
# Extract the sub-images using the x and y values
sub_images = window_view[x_values, y_values]
# Flatten the sub-images
flattened_sub_arrays_2 = sub_images.ravel()
main_array_2 [:len(flattened_sub_arrays_2)] = flattened_sub_arrays_2
print("--- %s seconds ---" % ((time.time() - start_time)))
print(flattened_sub_arrays_2.shape)


print("#########################")
print("#########################")

# compare the two methods
print(np.array_equal(main_array_1, main_array_2))

有什么办法能让这个更快吗?

推荐答案

One possible solution how to speed the function is to use + parallelize it:

from numba import njit, prange

@njit(parallel=True)
def store_array_numba(image, x_values, y_values, main_array):
    for i in prange(len(x_values)):
        x = x_values[i]
        y = y_values[i]
        idx = 64 * 64 * i
        for j in range(64):
            main_array[idx + 64 * j : idx + 64 * (j + 1)] = image[x + j, y : y + 64]

在我的机器上进行基准测试(AMD 5700x):

from timeit import timeit

import numpy as np
from numba import njit, prange

# Create a numpy array of 1024 x  1280
image = np.random.randint(0, 255, (1024, 1280), dtype=np.uint8)

N = 50
# Assuming x_values and y_values are  lists of x and y coordinates (that define the top-left corner of each sub-image)
x_values = np.random.randint(100, 900, N)
y_values = np.random.randint(100, 900, N)

# This is the array that will store the sub-images and which will be used by cuda
main_array_1 = np.zeros((N * 64 * 64))
main_array_2 = np.zeros((N * 64 * 64))


def store_array(image, x_values, y_values, main_array):
    sub_images = np.array(
        [image[x : x + 64, y : y + 64] for x, y in zip(x_values, y_values)]
    )
    flattened_sub_arrays = sub_images.ravel()
    main_array[: len(flattened_sub_arrays)] = flattened_sub_arrays
    return main_array


@njit(parallel=True)
def store_array_numba(image, x_values, y_values, main_array):
    for i in prange(len(x_values)):
        x = x_values[i]
        y = y_values[i]
        idx = 64 * 64 * i
        for j in range(64):
            main_array[idx + 64 * j : idx + 64 * (j + 1)] = image[x + j, y : y + 64]


main_array_1 = store_array(image, x_values, y_values, main_array_1)
store_array_numba(image, x_values, y_values, main_array_2)
assert np.allclose(main_array_1, main_array_2)

t1 = timeit(
    "store_array(image, x_values, y_values, main_array_1)",
    number=50,
    globals=globals(),
)
t2 = timeit(
    "store_array_numba(image, x_values, y_values, main_array_2)",
    number=50,
    globals=globals(),
)
print(f"time normal = {t1}")
print(f"time numba  = {t2}")

打印:

time normal = 0.0057696549920365214
time numba  = 0.001736361999064684

Python相关问答推荐

将jit与numpy linSpace函数一起使用时出错

'discord.ext. commanders.cog没有属性监听器'

对于一个给定的数字,找出一个整数的最小和最大可能的和

在Pandas DataFrame操作中用链接替换'方法的更有效方法

使可滚动框架在tkinter环境中看起来自然

如何将多进程池声明为变量并将其导入到另一个Python文件

如何在Polars中从列表中的所有 struct 中 Select 字段?

重置PD帧中的值

用SymPy在Python中求解指数函数

Gekko中基于时间的间隔约束

并行编程:同步进程

如何将相同组的值添加到嵌套的Pandas Maprame的倒数第二个索引级别

Pandas数据框上的滚动平均值,其中平均值的中心基于另一数据框的时间

无法在盐流道中获得柱子

如何将列表从a迭代到z-以抓取数据并将其转换为DataFrame?

在Pandas 中以十六进制显示/打印列?

高效地计算数字数组中三行上三个点之间的Angular

如何将验证器应用于PYDANC2中的EACHY_ITEM?

为什么在不先将包作为模块导入的情况下相对导入不起作用

如何批量训练样本大小为奇数的神经网络?