加速Python循环

发布于04月12日

我创建了一个for循环，该循环从一列中获取一个值，并在以后的数据中是否超过该值一两次时进行展望.该代码可以工作，但由于其运行的数据集非常大，因此代码变得非常慢.我怀疑特别是因为每次迭代都会计算超出值的次数(超过大约50万行).有没有办法加快速度？

import pandas as pd

df1 = pd.DataFrame({'index': [0,1,2,3,4], 'Time': ['2022-01-01','2022-01-02','2022-01-03','2022-01-04','2022-01-05'], 'A':[234,456,323,576,234], 'B': [0,1,0,1,0], 'B.v': [0,234,0,323,0], 'in' : [0,0,0,0,0], 'out':[0,0,0,0,0]})


def calc(df1):

    df2 = pd.DataFrame(df1[df1['B'] ==  1])

    for x in range(len(df2)):
        index = df2.iloc[x, df2.columns.get_loc('index')]
        tvalue = df2.iloc[x, df2.columns.get_loc('A')]
        pointvalue = df2.iloc[x, df2.columns.get_loc('B.v')]
        postrates = df1['A'].values[range(index,len(df1))]

        if sum(pointvalue > postrates) == 1:
            df1.iloc[index, df1.columns.get_loc('in')] = 1
        if sum(pointvalue > postrates) >= 2:
            df1.iloc[index, df1.columns.get_loc('in')] = 2

        if sum(tvalue < postrates) == 1:
            df1.iloc[index, df1.columns.get_loc('out')] = 1
        if sum(tvalue < postrates) >= 2:
            df1.iloc[index, df1.columns.get_loc('out')] = 2
    return df1

if __name__ == "__main__":
    print(calc(df1))

import numba @numba.njit(parallel=True) def calc_in_out(A, B, Bv, out_in, out_out): for idx in numba.prange(len(B)): val_b = B[idx] if val_b != 1: continue val_a = A[idx] val_Bv = Bv[idx] s1, s2 = 0, 0 for idx2 in range(idx, len(B)): s1 += val_Bv > A[idx2] s2 += val_a < A[idx2] # no need to count further: if s1 >= 2 and s2 >= 2: break if s1 == 1: out_in[idx] = 1 elif s1 >= 2: out_in[idx] = 2 if s2 == 1: out_out[idx] = 1 elif s2 >= 2: out_out[idx] = 2

from timeit import timeit import numba import numpy as np @numba.njit(parallel=True) def calc_in_out(A, B, Bv, out_in, out_out): for idx in numba.prange(len(B)): val_b = B[idx] if val_b != 1: continue val_a = A[idx] val_Bv = Bv[idx] s1, s2 = 0, 0 for idx2 in range(idx, len(B)): s1 += val_Bv > A[idx2] s2 += val_a < A[idx2] # no need to count further: if s1 >= 2 and s2 >= 2: break if s1 == 1: out_in[idx] = 1 elif s1 >= 2: out_in[idx] = 2 if s2 == 1: out_out[idx] = 1 elif s2 >= 2: out_out[idx] = 2 def setup_df(N=500_000): return pd.DataFrame( { "Time": ["2022-01-01"] * N, "A": np.random.randint(10, 1000, size=N), "B": np.random.randint(0, 2, size=N), "B.v": np.random.randint(10, 1000, size=N), "in": [0] * N, "out": [0] * N, } ) def main(): df1 = pd.DataFrame( { "index": [0, 1, 2, 3, 4], "Time": [ "2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", ], "A": [234, 456, 323, 576, 234], "B": [0, 1, 0, 1, 0], "B.v": [0, 234, 0, 323, 0], "in": [0, 0, 0, 0, 0], "out": [0, 0, 0, 0, 0], } ) # this will compile calc_in_out calc_in_out( df1["A"].values, df1["B"].values, df1["B.v"].values, df1["in"].values, df1["out"].values, ) print(df1) to_run = """calc_in_out( df1["A"].values, df1["B"].values, df1["B.v"].values, df1["in"].values, df1["out"].values, )""" t = timeit(to_run, setup="df1=setup_df()", globals=globals(), number=1) print(t) if __name__ == "__main__": main()

加速Python循环

推荐答案

Python相关问答推荐

剧作家Python：expect(locator).to_be_visible()vs locator.wait_for()

将HLS纳入媒体包

在Python和matlab中显示不同 colored颜色的图像

如何计算两极打印机中 * 所有列 * 的出现次数？

点到面的Y距离

如果值不存在，列表理解返回列表

C#使用程序从Python中执行Exec文件

Python虚拟环境的轻量级使用

在pandas中使用group_by，但有条件

mypy无法推断类型参数.List和Iterable的区别

名为main. py的Python模块在导入时不运行'

matplotlib + python foor loop

寻找Regex模式返回与我当前函数类似的结果

从列表中获取n个元素，其中list [i][0]== value''

为什么我的sundaram筛这么低效

Django.core.exceptions.SynchronousOnlyOperation您不能从异步上下文中调用它-请使用线程或SYNC_TO_ASYNC

大型稀疏CSR二进制矩阵乘法结果中的错误

EST格式的Azure数据库笔记本中的当前时间戳

极点用特定值替换前n行

将数据从一个单元格保存到Jupyter笔记本中的下一个单元格