将列表理解与将列表转换为NumPy一起使用,以提高性能:
df['a_avg'] = [(np.round(np.array(p) - np.average(p), 3)).tolist() for p in df['A']]
或者:
df['a_avg'] = df.A.apply(lambda p: (np.round(np.array(p) - np.average(p), 3)).tolist())
print (df)
A a_avg
0 [4.2, 2.3, 6.5, 2.3] [0.375, -1.525, 2.675, -1.525]
1 [4.1, 5.3, 6.5, 3.8] [-0.825, 0.375, 1.575, -1.125]
如果每个列表的长度相同,则矢量化解决方案有效:
arr = np.array(df.A.tolist())
df['a_avg'] = np.round(arr - np.average(arr), 3).tolist()
print (df)
A a_avg
0 [4.2, 2.3, 6.5, 2.3] [-0.175, -2.075, 2.125, -2.075]
1 [4.1, 5.3, 6.5, 3.8] [-0.275, 0.925, 2.125, -0.575]
Testing performance:
df = pd.DataFrame({'A':[[4.2,2.3,6.5,2.3],[4.1,5.3,6.5,3.8]]})
#2k rows
df = pd.concat([df] * 1000, ignore_index=True)
#Timeless solution
In [36]: %timeit df["a_avg"] = [[round(e - np.average(lst), 3) for e in lst] for lst in df["A"]]
118 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %timeit df['a_avg'] = [(np.round(np.array(p) - np.average(p), 3)).tolist() for p in df['A']]
37.1 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [38]: %timeit df['a_avg'] = df.A.apply(lambda p: (np.round(np.array(p) - np.average(p), 3)).tolist())
37.7 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [39]: %%timeit
...: arr = np.array(df.A.tolist())
...: df['a_avg'] = np.round(arr - np.average(arr), 3).tolist()
...:
1.36 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Testing in 20k rows个
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [41]: %timeit df["a_avg"] = [[round(e - np.average(lst), 3) for e in lst] for lst in df["A"]]
1.18 s ± 5.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [42]: %timeit df['a_avg'] = [(np.round(np.array(p) - np.average(p), 3)).tolist() for p in df['A']]
366 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [43]: %timeit df['a_avg'] = df.A.apply(lambda p: (np.round(np.array(p) - np.average(p), 3)).tolist())
364 ms ± 824 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: arr = np.array(df.A.tolist())
...: df['a_avg'] = np.round(arr - np.average(arr), 3).tolist()
...:
...:
13.7 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)