Python 在Pandas 的列子集上替代 for 循环

发布于07月24日

我在Pandas 身上遇到了一个与时间消耗有关的问题:

代码如下所示:

df = pd.DataFrame({"IDs": [1, 1, 1, 2, 2, 2, 3, 3, 3],
                   "Month": ["01", "02", "01", "01", "02", "01", "01", "02", "01"],
                   "column1": [0.9, 0.5, 0.3, 0.8, 0.5, 0.1, 0.6, 0.2, 0.8]})

df_list = []
for id in df.IDs.unique():
    temp = df[df.IDs == id]
    temp = temp.groupby("Month").mean()
    temp2 = temp['column1'].ewm(span=3, adjust=True).sum()
    df_list.append(temp2)

注意，unique IDs包含约500k个元素，原始数据帧df包含约6mil个记录.

现在我用tqdm判断估计的时间，需要14-15个小时才能完成.如果我只有temp = df[df.IDs == id]行的偶数循环，并且估计的时间是相同的(基本上这些都是pandas函数，所以应该不会产生任何性能问题).所以问题就在这条线上.

有没有其他方法可以做到这一点？谢谢你的建议.

推荐答案

out = (df.groupby(['IDs', 'Month'])
         .mean()['column1']
         .ewm(span=3, adjust=True)
         .sum())
print(out.reset_index())

输出:

   IDs Month  column1
0    1    01  0.60000
1    1    02  0.80000
2    2    01  0.85000
3    2    02  0.92500
4    3    01  1.16250
5    3    02  0.78125

out = (df.groupby('IDs')
         .apply(lambda x: x.groupby('Month')
                           .mean()['column1']
                           .ewm(span=3, adjust=True)
                           .sum())
         .stack())
print(out.reset_index(name='column1'))

输出:

   IDs Month  column1
0    1    01    0.600
1    1    02    0.800
2    2    01    0.450
3    2    02    0.725
4    3    01    0.700
5    3    02    0.550