我正在对三维ndarray(KxMxN)进行行缩减,即取一列的所有值,并使用Reduce函数产生标量值;最终,KxMxN矩阵将变成KxN阶的二维ndarray.

我想要解决的问题是:

  1. 取一个3D矩阵并将其分成16个子矩阵(或任何大于2的偶数个分裂,16是最佳的);
  2. 通过从16块中 Select 一半的子矩阵(8)并将这8个子矩阵连接到一个矩阵中来生成所有组合;因此每个组合都有一个矩阵;
  3. 为新矩阵的每一列,为所有组合计算一个标量值,最好是一次计算一个标量值.

还有更多的实现细节,我将在此过程中解释.

3-D ndarray是浮点数.

在下面的例子中,njitnumpy是我目前所能得到的最好的结果.我想知道从任何Angular 看,是否还有进一步改进的空间.

cupy(GPU并行化)、dask(CPU并行化)或numba并行化都未能击败下面的(我的用例显然太微不足道了,无法利用GPU的能力,而且我只有8G GPU).这些工具很有可能被用在更高级的方式上,这是我不知道的.

from numba import njit, guvectorize, float64, int64
from math import sqrt
import numba as nb
import numpy as np
import itertools


# Create a 2D ndarray
m = np.random.rand(800,100)
# Reshape it into a list of sub-matrices
mr = m.reshape(16,50,100)

# Create an indices matrix from combinatorics 
# a typical one for me "select 8 from 16", 12870 combinations
# I do have a custom combination generator, but this is not what I wanted to optimise and itertools really has done a decent job already.
x = np.array( list(itertools.combinations(np.arange(16),8)) )

# Now we are going to select 8 sub-matrices from `mr` and reshape them to become one bigger sub-matrix; we do this in list comprehension.
# This is the matrix we are going to reduce.

# Bottleneck 1: This line takes the longest and I'd hope to improve on this line, but I am not sure there's much we could do here.

m3d = np.array([mr[idx_arr].reshape(400,100) for idx_arr in x])

# We create different versions of the same reduce function. 

# Bottleneck 2: The reduce function is another place I'd want to improve on.

# col - column values
# days - trading days in a year
# rf - risk free rate

# njit version with instance function `mean`, `std`, and python `sqrt`
@njit
def nb_sr(col, days, rf):
    mean = (col.mean() * days) - rf
    std = col.std() * sqrt(days)
    return mean / std

# njit version with numpy
@njit
def nb_sr_np(col, days, rf):
    mean = (np.mean(col) * days) -rf
    std = np.std(col) * np.sqrt(days)
    return mean / std

# guvectorize with numpy
@guvectorize([(float64[:],int64,float64,float64[:])], '(n),(),()->()', nopython=True)
def gu_sr_np(col,days,rf,res):
    mean = (np.mean(col) * days) - rf
    std = np.std(col) * np.sqrt(days)
    res[0] = mean / std

# We wrap them such that they can be applied on 2-D matrix with list comprehension.

# Bottleneck 3: I was thinking to probably vectorize this wrapper, but the closest I can get is list comprehension, which isn't really vectorization.

def nb_sr_wrapper(m2d):
    return [nb_sr(r, 252, .25) for r in m2d.T]

def nb_sr_np_wrapper(m2d):
    return [nb_sr_np(r, 252, .25) for r in m2d.T]

def gu_sr_np_wrapper(m2d):
    return [gu_sr_np(r, 252, .25) for r in m2d.T]

# Finally! here's our performance benchmarking step.

%timeit np.array( [nb_sr_wrapper(m) for m in m3d.T] )
# output: 4.26 s ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [nb_sr_np_wrapper(m) for m in m3d.T] )
# output: 4.33 s ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [gu_sr_np_wrapper(m) for m in m3d.T] )
# output: 6.06 s ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

推荐答案

TL;DR下面的最终版本不需要m3d.相反,它直接使用mrx.这实现了步骤1、2、3的88x speedup(或仅步骤2和3的54倍加速).最后一个解决方案的时间是78ms.

瓶颈1相对容易解决:

a, b = x.shape
_, _, N = mr.shape

new_m3d =  mr[x].reshape(a, -1, N)

>>> np.array_equal(m3d, new_m3d)
True

在所提供的维度上,它的速度约为3.5倍(16, 50, 100mr12870, 8x).

对于瓶颈#2和#3,功能可以更简单地表示:

out2 = ((np.mean(m3d, axis=1) - rf/days) / np.std(m3d, axis=1)).T * sqrt(days)

>>> np.allclose(out, out2)
True

而且,在时间上,这比np.array( [nb_sr_wrapper(m) for m in m3d.T] )快了大约2倍.

请注意,out2计算中的大部分时间都花在np.std(m3d, axis=1)上.让我们也试着加快速度:

def optstd(mr, x, ddof=0):
    n = mr.shape[1] * x.shape[1] - ddof
    m2 = np.sum(mr**2, axis=1) / n
    m = np.sum(mr, axis=1) / n
    return np.sqrt(m2[x].sum(1) - m[x].sum(1)**2)

out3 = ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)

>>> np.allclose(out, out3)
True

现在,我们得到了一些真正的加速:

%timeit ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)
358 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这比np.array( [nb_sr_wrapper(m) for m in m3d.T] )快了大约12倍.

最终版本

最后,我们注意到上面仍然做重复的操作(我们可以重复使用上面的m来获得np.mean()).因此,我们可以将上面的几个元素组合在一起,实际上在此计算中完全go 掉了m3d:

def optcalc(mr, x, days, rf, ddof=0):
    n = mr.shape[1] * x.shape[1]
    m2 = np.sum(mr**2, axis=1)
    m = np.sum(mr, axis=1)
    A = m[x].sum(1) / n
    B = np.sqrt(m2[x].sum(1) / (n - ddof) - m[x].sum(1)**2 / (n - ddof)**2)
    return ((A - rf/days) / B).T * sqrt(days)

out4 = optcalc(mr, x, days, rf)

>>> np.allclose(out, out4)
True

%timeit optcalc(mr, x, days, rf)
# 78.1 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The timing is now 54x faster than the original for out, and we don't need m3d for it. So, counting the time for that 1st step, the overall speedup is 88.89x.

Why is this faster?

  1. 没有巨 Python 循环.
  2. 把乘法和减法尽可能移到"外面",因为那里的维度已经降低了.
  3. 我们不是对所有列的组合执行std,而是使用mr上的操作,并计算连接上的等价std.
  4. In the final proposal, optcalc does both mean and std from first principles.

Python相关问答推荐

使用Python从HTTP打印值

将C struct 的指针传递给Python中的ioctel

在后台运行的Python函数

如何编写一个正规表达式来查找序列中具有2个或更多相同辅音的所有单词

使用imap-tools时错误,其邮箱地址包含域名中的非默认字符

在Arrow上迭代的快速方法.Julia中包含3000万行和25列的表

是什么导致对Python脚本的jQuery Ajax调用引发500错误?

在使用Guouti包的Python中运行MPP模型时内存不足

如果条件为真,则Groupby.mean()

pandas DataFrame GroupBy.diff函数的意外输出

如何使用symy打印方程?

Pystata:从Python并行运行stata实例

在Python中处理大量CSV文件中的数据

如何从具有不同len的列表字典中创建摘要表?

从dict的列中分钟

Python,Fitting into a System of Equations

Pandas—在数据透视表中占总数的百分比

改进大型数据集的框架性能

调用decorator返回原始函数的输出

在嵌套span下的span中擦除信息