我正在寻找帮助来加快pandas中的滚动计算,该计算将使用预定义的最大数量的最近观察值来计算滚动平均值.以下是生成示例帧和帧本身的代码:

import pandas as pd
import numpy as np

tmp = pd.DataFrame(
    [
        [11.1]*3 + [12.1]*3 + [13.1]*3  + [14.1]*3 + [15.1]*3 + [16.1]*3 + [17.1]*3 + [18.1]*3,
        ['A', 'B', 'C']*8,
        [np.nan]*6 + [1, 1, 1] + [2, 2, 2] + [3, 3, 3] + [np.nan]*9
    ],
    index=['Date', 'Name', 'Val']
)
tmp = tmp.T.pivot(index='Date', columns='Name', values='Val')

Name    A    B    C
Date               
11.1  NaN  NaN  NaN
12.1  NaN  NaN  NaN
13.1    1    1    1
14.1    2    2    2
15.1    3    3    3
16.1  NaN  NaN  NaN
17.1  NaN  NaN  NaN
18.1  NaN  NaN  NaN

我想得到这个结果:

Name    A    B    C
Date               
11.1  NaN  NaN  NaN
12.1  NaN  NaN  NaN
13.1  1.0  1.0  1.0
14.1  1.5  1.5  1.5
15.1  2.5  2.5  2.5
16.1  2.5  2.5  2.5
17.1  3.0  3.0  3.0
18.1  NaN  NaN  NaN

Attempted Solution

我try 了下面的代码,它可以工作,但对于我在实践中坚持使用的数据集来说,它的性能非常差.

tmp.rolling(window=3, min_periods=1).apply(lambda x: x[~np.isnan(x)][-2:].mean(), raw=True)

将上述计算应用于3k x 50 k帧大约需要20分钟.也许有更优雅、更快的方法可以获得同样的结果?也许使用多个滚动计算结果的组合或具有groupby的结果?

Versions

Python - 3.9.13、pandas - 2.0.3和numpy - 1.25.2

推荐答案

一个 idea 是使用numba通过参数engine='numba'Rolling.apply中更快地输出计数:

(tmp.rolling(window=3, min_periods=1)
    .apply(lambda x: x[~np.isnan(x)][-2:].mean(), engine='numba', raw=True))

Test performance:

tmp = pd.concat([tmp] * 100000, ignore_index=True)

In [88]: %timeit tmp.rolling(window=3, min_periods=1).apply(lambda x: x[~np.isnan(x)][-2:].mean(),engine='numba', raw=True)
901 ms ± 6.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [89]: %timeit tmp.rolling(window=3, min_periods=1).apply(lambda x: x[~np.isnan(x)][-2:].mean(), raw=True)
13 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy approach:

您可以将DataFrame转换为3d数组,并附加前NaN个值,然后将非NaN个值转移并获取含义:

#https://stackoverflow.com/a/44559180/2901002
def justify_nd(a, invalid_val, axis, side):    
    """
    Justify ndarray for the valid elements (that are not invalid_val).

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    invalid_val : scalar
        invalid value
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. Must be 'front' or 'end'.
        So, with 'front', valid elements are pushed to the front and
        with 'end' valid elements are pushed to the end along specified axis.
    """
    
    pushax = lambda a: np.moveaxis(a, axis, -1)
    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    
    if side=='front':
        justified_mask = np.flip(justified_mask,axis=axis)
            
    out = np.full(a.shape, invalid_val)
    if (axis==-1) or (axis==a.ndim-1):
        out[justified_mask] = a[mask]
    else:
        pushax(out)[pushax(justified_mask)] = pushax(a)[pushax(mask)]
    return out

from numpy.lib.stride_tricks import sliding_window_view as swv

window_size = 3
N = 2

a = tmp.astype(float).to_numpy()
arr = np.vstack([np.full((window_size-1,a.shape[1]), np.nan),a])

out = np.nanmean(justify_nd(swv(arr, window_size, axis=0), 
                            invalid_val=np.nan, axis=2, side='end')[:, :, -N:], 
                 axis=2)

print (out)
[[nan nan nan]
 [nan nan nan]
 [1.  1.  1. ]
 [1.5 1.5 1.5]
 [2.5 2.5 2.5]
 [2.5 2.5 2.5]
 [3.  3.  3. ]
 [nan nan nan]]

df = pd.DataFrame(out, index=tmp.index, columns=tmp.columns)
print (df)
Name    A    B    C
Date               
11.1  NaN  NaN  NaN
12.1  NaN  NaN  NaN
13.1  1.0  1.0  1.0
14.1  1.5  1.5  1.5
15.1  2.5  2.5  2.5
16.1  2.5  2.5  2.5
17.1  3.0  3.0  3.0
18.1  NaN  NaN  NaN

Performance:

tmp = pd.concat([tmp] * 100000, ignore_index=True)


In [99]: %%timeit
    ...: a = tmp.astype(float).to_numpy()
    ...: arr = np.vstack([np.full((window_size-1,a.shape[1]), np.nan),a])
    ...: 
    ...: out = np.nanmean(justify_nd(swv(arr, window_size, axis=0), 
    ...:                             invalid_val=np.nan, 
                                     axis=2, side='end')[:, :, -N:], axis=2)
    ...: 
    ...: df = pd.DataFrame(out, index=tmp.index, columns=tmp.columns)
    ...: 

338 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python相关问答推荐

列表上值总和最多为K(以O(log n))的最大元素数

Python:在类对象内的字典中更改所有键的索引,而不是仅更改一个键

返回nxon矩阵的diag元素,而不使用for循环

Python在tuple上操作不会通过整个单词匹配

ModuleNotFound错误:没有名为Crypto Windows 11、Python 3.11.6的模块

对于一个给定的数字,找出一个整数的最小和最大可能的和

如何在虚拟Python环境中运行Python程序?

Django REST Framework:无法正确地将值注释到多对多模型,不断得到错误字段名称字段对模型无效'<><>

如何并行化/加速并行numba代码?

LocaleError:模块keras._' tf_keras. keras没有属性__internal_'''

ruamel.yaml dump:如何阻止map标量值被移动到一个新的缩进行?

BeautifulSoup:超过24个字符(从a到z)的迭代失败:降低了首次深入了解数据集的复杂性:

使用Python异步地持久跟踪用户输入

GPT python SDK引入了大量开销/错误超时

当HTTP 201响应包含 Big Data 的POST请求时,应该是什么?  

在我融化极点数据帧之后,我如何在不添加索引的情况下将其旋转回其原始形式?

如何在信号的FFT中获得正确的频率幅值

如何在Django模板中显示串行化器错误

如何写一个polars birame到DuckDB

如何从数据框列中提取特定部分并将该值填充到其他列中?