我在指定拉姆达函数时遇到困难.我希望拥有类似下面的Lambda的东西,但不完全是这样.代码应将rejected_time与组内的任何paid_out_time进行比较,如果rejected_time发生在任何paid_out_time后5分钟内,则返回True.

f = lambda x: ((x['rejected_time'].dropna() - x['paid_out_time'].dropna()).between(pd.Timedelta(0), pd.Timedelta(minutes=5)))

使用x['paid_out_time'].min()会产生大约x['paid_out_time'].min() k个True值,但删除.min()会导致显着减少.我不知道如何使用所有paid_out_times与逐行reposed_time进行比较,并查看拒绝时间是否发生在paid_out_time之后0- 5分钟.

我一直在测试这个代码:

cols = ['paid_out_time', 'rejected_time']
df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')

f = lambda x: ((x['rejected_time'].dropna() - x['paid_out_time'].dropna().min()).between(pd.Timedelta(0), pd.Timedelta(minutes=5)))

df['paid_out_auto_rejection'] = df.groupby('personal_id', group_keys=False).apply(f).astype(int)

以下是一些测试数据:

personal_id application_id rejected_time paid_out_time expected
26A 1ab 2022-09-12 09:20:40.592 NaT 1
26A 1ab 2022-08-23 07:40:03.447463 NaT 0
26A 1ab 2022-08-02 23:16:59.545392 NaT 1
26A 1ab 2022-08-02 23:16:59.545392 NaT 1
26A 1ab 2022-09-12 09:20:40.592000 2022-08-02 23:16:59.545392 1
26A 1ab 2022-09-02 18:33:42.226000 NaT 0
26A 8f0 2022-09-12 09:20:40.592000 NaT 1
26A 8f0 2022-09-12 09:20:40.592000 NaT 1
26A 8f0 NaT 2022-09-12 09:20:40.592 0
26A 8f0 2022-09-12 09:21:08.604000 NaT 1
26A 8f0 2022-09-22 08:27:45.693060 NaT 0

推荐答案

编辑:为了提高性能,使用merge_asoftolerance参数:

cols = ['paid_out_time', 'rejected_time']
df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')

df1 = df[['personal_id','application_id','rejected_time']].reset_index()
df2 = df[['personal_id','application_id','paid_out_time']]

df3 = pd.merge_asof(df1.sort_values('rejected_time').dropna(subset=['rejected_time']), 
                   df2.sort_values('paid_out_time').dropna(subset=['paid_out_time']), 
                   left_on='rejected_time', 
                   right_on='paid_out_time',
                   by='personal_id',
                   direction="nearest",
                   tolerance=pd.Timedelta("5Min")
                   ).set_index('index')

df['new'] = (df3['rejected_time'].sub(df3['paid_out_time']).notna()
                                 .reindex(df1.index, fill_value=0)
                                 .astype(int))

print (df)
   personal_id application_id              rejected_time  \
0         26A            1ab  2022-09-12 09:20:40.592000   
1         26A            1ab  2022-08-23 07:40:03.447463   
2         26A            1ab  2022-08-02 23:16:59.545392   
3         26A            1ab  2022-08-02 23:16:59.545392   
4         26A            1ab  2022-09-12 09:20:40.592000   
5         26A            1ab  2022-09-02 18:33:42.226000   
6         26A            8f0  2022-09-12 09:20:40.592000   
7         26A            8f0  2022-09-12 09:20:40.592000   
8         26A            8f0                         NaT   
9         26A            8f0  2022-09-12 09:21:08.604000   
10        26A            8f0  2022-09-22 08:27:45.693060   

                paid_out_time  new  
0                         NaT    1  
1                         NaT    0  
2                         NaT    1  
3                         NaT    1  
4  2022-08-02 23:16:59.545392    1  
5                         NaT    0  
6                         NaT    1  
7                         NaT    1  
8  2022-09-12 09:20:40.592000    0  
9                         NaT    1  
10                        NaT    0  

如果需要比较所有非缺失值,使用numpy广播:

cols = ['paid_out_time', 'rejected_time']
df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')


def f(x):
    
    arr = x['rejected_time'].dropna().to_numpy() - 
                      x['paid_out_time'].dropna().to_numpy()[:, None]
    m = (arr >= pd.Timedelta(0)) & (arr <= pd.Timedelta(minutes=5))
    
    x.loc[x['rejected_time'].notna(), 'new'] = np.any(m, axis=0)
    return x
out = (df.groupby('personal_id', group_keys=False).apply(f)
          .fillna({'new':False}).astype({'new':int}))

print (out)
   personal_id application_id              rejected_time  \
0         26A            1ab  2022-09-12 09:20:40.592000   
1         26A            1ab  2022-08-23 07:40:03.447463   
2         26A            1ab  2022-08-02 23:16:59.545392   
3         26A            1ab  2022-08-02 23:16:59.545392   
4         26A            1ab  2022-09-12 09:20:40.592000   
5         26A            1ab  2022-09-02 18:33:42.226000   
6         26A            8f0  2022-09-12 09:20:40.592000   
7         26A            8f0  2022-09-12 09:20:40.592000   
8         26A            8f0                         NaT   
9         26A            8f0  2022-09-12 09:21:08.604000   
10        26A            8f0  2022-09-22 08:27:45.693060   

                paid_out_time  new  
0                         NaT    1  
1                         NaT    0  
2                         NaT    1  
3                         NaT    1  
4  2022-08-02 23:16:59.545392    1  
5                         NaT    0  
6                         NaT    1  
7                         NaT    1  
8  2022-09-12 09:20:40.592000    0  
9                         NaT    1  
10                        NaT    0  

Python相关问答推荐

获取2个字节之间的异或

如何分割我的收件箱,以便连续的数字各自位于自己的收件箱中?

如何将Pydantic URL验证限制为特定主机或网站

两极:滚动组,起始指数由不同列设置

如何使用关键参数按列对Pandas rame进行排序

如何在Power Query中按名称和时间总和进行分组

如何判断LazyFrame是否为空?

Pandas滚动分钟,来自其他列的相应值

Python:记录而不是在文件中写入询问在多文件项目中记录的最佳实践

opencv Python稳定的图标识别

Python中的负前瞻性regex遇到麻烦

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

如何在BeautifulSoup中链接Find()方法并处理无?

如何计算两极打印机中 * 所有列 * 的出现次数?

在Python Attrs包中,如何在field_Transformer函数中添加字段?

运行终端命令时出现问题:pip start anonymous"

当从Docker的--env-file参数读取Python中的环境变量时,每个\n都会添加一个\'.如何没有额外的?

不能使用Gekko方程'

考虑到同一天和前2天的前2个数值,如何估算电力时间序列数据中的缺失值?

AES—256—CBC加密在Python和PHP中返回不同的结果,HELPPP