你并不是一个最小的例子.所以我们首先试着看看最小的例子会发生什么,然后我们在完整的例子上测试性能.
Data
import pandas as pd
import numpy as np
from scipy.stats import norm
# Number of simulations
trials = 10000
# Generate random variables
df1 = pd.DataFrame(norm.rvs(size = (500, trials)))
Min example
在这里,我既减少了数据量,又更改了您的函数以使用更少的数据
df_min = df1[range(3)][:10]
# backup
df_min_bk = df_min.copy()
f_min = lambda x: np.sum(x > 0) > 2
其中df_min
是
0 1 2
0 0.407418 1.741455 -0.270929
1 -0.530294 1.248405 1.201781
2 -1.193793 -0.088235 0.991222
3 -0.941380 0.499053 -0.913778
4 0.951970 -2.073895 -1.179818
5 -1.666666 1.143326 1.266971
6 0.688032 -0.188798 -0.130474
7 0.618970 -0.595450 1.420563
8 1.370329 -0.904624 1.167164
9 -0.571588 0.547064 -1.169145
Run minimal example
使用应用
%%time
for col in df_min.columns:
df_min[col] = df_min[col].rolling(window=3).apply(f_min)
CPU times: user 10.6 ms, sys: 693 µs, total: 11.3 ms
Wall time: 11 ms
输出结果是
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 0.0 0.0 0.0
6 0.0 0.0 0.0
7 0.0 0.0 0.0
8 1.0 0.0 0.0
9 0.0 0.0 0.0
避免应用
设置df_min = df_min_bk.copy()
,并使用内置函数将相同的函数重写为
for col in df_min.columns:
df_min[col] = df_min[col].gt(0).rolling(window=3).sum().gt(2).astype(int)
CPU times: user 1.02 ms, sys: 3.21 ms, total: 4.24 ms
Wall time: 3.9 ms
Which is almost 3x the previous case 输出结果是 still
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 1 0 0
9 0 0 0
如果我们记住滚动窗口的前n-1
列应该是NaN,这是可以接受的.
避免循环列
再次设置为df_min = df_min_bk.copy()
,我们可以使用不循环列的PRIME函数
%%time
df_min = df_min.gt(0).rolling(window=3).sum().gt(2).astype(int)
CPU times: user 2.21 ms, sys: 0 ns, total: 2.21 ms
Wall time: 2.22 ms
这几乎是珍贵表壳的2倍,应用表壳的6倍.输出与上一个示例相同.
Full Example
%%time
df2 = df2.gt(0).rolling(window=30).sum().gt(20).astype(int)
CPU times: user 607 ms, sys: 27 ms, total: 634 ms
Wall time: 633 ms
这只需要不到一秒钟的时间.虽然应用和循环遍历列需要几分钟
计时应用
CPU times: user 8min 40s, sys: 150 ms, total: 8min 40s
Wall time: 8min 40s
与以前的方法相比,speedup是820x.
Conclusion
首先播放您可以可视化的少量数据,然后最终播放几个完整的列,然后播放所有数据.