Python3.x Python For 循环数百万行

发布于09月15日

我有一个dataframe c，有很多不同的列.此外，arr是对应于c:arr = c[c['A_D'] == 'A']的子集的数据帧.

我的代码的主要思想是迭代c数据帧中的所有行，并搜索所有可能发生某些特定情况的情况(在arr数据帧中):

只需迭代行c['A_D'] == D和c['Already_linked'] == 0即可
arr数据帧中的hour必须小于c数据帧中的hour_aux
arr数据帧的第Already_linked列必须为0:arr.Already_linked == 0
在c和arr数据帧中，Terminal和Operator需要相同

现在，使用布尔索引和groupby get_group存储条件:

按arr数据帧分组，以便 Select 相同的操作员和终端:g = groups.get_group((row.Operator, row.Terminal)
仅 Select 小时数小于c数据帧中小时数且已链接==0:vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]的到达

对于验证所有条件的c数据帧中的每一行，将创建一个vb数据帧.当然，这个数据帧在每次迭代中都有不同的长度.在创建vb数据帧之后，我的目标是 Select vb数据帧的索引，以最小化vb.START和c之间的时间[x].然后，对应于该索引的FightID被存储在列a上的c数据帧中.此外，由于到达链接到离开，arr数据帧中的Already_linked列从0更改为1.

需要注意的是，arr数据帧的列Already_linked在每次迭代中都可能发生变化(arr.Already_linked == 0是创建vb数据帧的条件之一).因此，无法并行化此代码.

我已经使用了c.itertuples()来提高效率，但是由于c有数百万行，这段代码仍然太耗时.

另一种 Select 是每行使用pd.apply.尽管如此，这并不是很简单，因为在每个循环中，c和arr的值都会发生变化(而且，我相信即使是pd.apply，也会非常慢).

有没有可能在矢量化解决方案中转换这个for循环(或者将运行时间减少10倍(如果可能甚至更多))？

初始数据帧:

START     END       A_D     Operator     FlightID    Terminal   TROUND_ID   tot
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  QR001   4   QR002       70
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL001   3   "        "  84
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL001   3   "        "  78
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  VS001   3   "        "  45
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  DL401   3   "        "  9
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  DL401   3   "        "  19
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  DL401   3   "        "  3
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  DL401   3   "        "  32
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  DL401   3   "        "  95
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  DL402   3   "        "  58

所需输出(以下数据框中排除了部分列.只有a和Already_linked列相关):

    START                    END             A_D  Operator  a   Already_linked
0   2017-03-26 16:55:00 2017-10-28 16:55:00 A   QR  0               1
1   2017-03-26 09:30:00 2017-06-11 09:30:00 D   DL  DL402           1
2   2017-03-27 09:30:00 2017-10-28 09:30:00 D   DL  DL401           1
3   2017-10-08 15:15:00 2017-10-22 15:15:00 D   VS  No_link_found   0
4   2017-03-26 06:50:00 2017-06-11 06:50:00 A   DL  0               0
5   2017-03-27 06:50:00 2017-10-28 06:50:00 A   DL  0               1
6   2017-03-29 06:50:00 2017-04-19 06:50:00 A   DL  0               0
7   2017-05-03 06:50:00 2017-10-25 06:50:00 A   DL  0               0
8   2017-06-25 06:50:00 2017-10-22 06:50:00 A   DL  0               0
9   2017-03-26 07:45:00 2017-10-28 07:45:00 A   DL  0               1

代码:

groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
    try:
        g = groups.get_group((row.Operator, row.Terminal))
        vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
        aux = (vb.START - row.x).abs().idxmin()
        c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
        arr.loc[aux, 'Already_linked'] = 1
        continue
    except:
        continue

c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked  == 0) & (c.A_D == 'D'),'No_link_found',c['a'])

初始c数据帧的代码:

import numpy as np
import pandas as pd
import io

s = '''
 A_D     Operator     FlightID    Terminal   TROUND_ID   tot
 A   QR  QR001   4   QR002       70
 D   DL  DL001   3   "        "  84
 D   DL  DL001   3   "        "  78
 D   VS  VS001   3   "        "  45
 A   DL  DL401   3   "        "  9
 A   DL  DL401   3   "        "  19
 A   DL  DL401   3   "        "  3
 A   DL  DL401   3   "        "  32
 A   DL  DL401   3   "        "  95
 A   DL  DL402   3   "        "  58
'''

data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)

d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
           '2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
           '2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
           '2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00', 
           '2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}    

aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)

c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'), 
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START -  pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != "        ", 1 ,0)

arr = c[c['A_D'] == 'A']

Python3.x Python For 循环数百万行

推荐答案

Python-3.x相关问答推荐

Python3和请求-超文本标记语言：试图抓取一个网站-没有取回真正的超文本标记语言代码

循环遍历数据框以提取特定值

如何使用TensorFlow Keras子类化来构建和训练模型

为什么 tkinter 在 tkinter 窗口外计算鼠标事件？

Django中自动设置/更新字段

如何在类中的函数(以 self 作为第一个参数)中使用递归

切片的Python复杂性与元组的星号相结合

从日志(log)文件中查找延迟最低的用户

Seaborn：注释线性回归方程

如何从 Python 3.5 降级到 Python 3.4

UnicodeDecodeError：utf-8编解码器无法解码位置 1 的字节 0x8b：无效的起始字节，同时读取Pandas中的 csv 文件

django.core.exceptions.ImproperlyConfigured

创建一个可旋转的 3D 地球

Python 3 变量名中接受哪些 Unicode 符号？

sys.stdin.readline() 和 input()：读取输入行时哪个更快，为什么？

登录csv文件的正确方法是什么？

尾部斜杠的 FastAPI 重定向返回非 ssl 链接

三个参数的reduce函数

注册 Celery 基于类的任务

在 PyCharm 中配置解释器：请使用不同的 SDK 名称