运行一系列不同的示例,下面代码中的第二个方法对于示例数据集来说大约快了X75:
import pandas as pd, numpy as np
from random import randint
import time
data = [randint(90, 120) for i in range(10000)]
df1 = pd.DataFrame({'price': data})
t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
j = 1
up = df1.price.iloc[i] + 2
down = up - 4
while (pos := i + j) < count:
if(value := df1.price.iloc[pos]) >= up:
df1.loc[i, 'updown'] = "Up"
break
elif value <= down:
df1.loc[i, 'updown'] = "Down"
break
else:
j = j + 1
t1 = time.time()
print(f'Method 1: {t1 - t0}')
res1 = df1.head()
df2 = pd.DataFrame({'price': data})
t2 = time.time()
count = len(df2)
df2['updown'] = np.nan
up = df2.price + 2
down = df2.price - 2
# increase shift range until updown is set for all columns
# or there is insufficient data to change remaining rows
i = 1
while (i < count) and (not (isna := df2.updown.isna()) is None and ((i == 1) or (isna[:-(i - 1)].any()))):
shift = df2.price.shift(-i)
df2.loc[isna & (shift >= up), 'updown'] = 'Up'
df2.loc[isna & (shift <= down), 'updown'] = 'Down'
i += 1
t3 = time.time()
print(f'Method 2: {t3 - t2}')
s1 = df1.updown
s2 = df2.updown
match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
print(f'Series match: {match}')
速度提高的主要原因是,我们在数据数组上进行操作,而不是在Python中遍历各行,这在C代码中都会发生.虽然Python调用Pandas或NumPy(这是C库)非常快,但有一些开销,如果您这样做的时间很长,它很快就会成为限制因素.
性能的提高取决于输入数据,但随数据帧中的行数而扩展:行越多,迭代速度就越慢:
iterations method1 method2 increase
0 100 0.056002 0.018267 3.065689
1 1000 0.209895 0.005000 41.982070
2 10000 2.625701 0.009001 291.727054
3 100000 108.080149 0.042001 2573.260448