我需要按应用程序ID分组.然后,在单独的行中比较组内的两个时间戳,并创建布尔序列"USER_REJECTS".USER_REJECTS应指示REJECTED_TIME==SELECTED_TIME(在组内的任何其他行上).
数据集有数百万行和100多个变量,而Applications_ID每个有15-25行,因此效率是一个因素.
示例数据
application_id | id | creation_timestamp | selected_time | rejected_time |
---|---|---|---|---|
69c0 | 7 | 2023-11-20 05:32:26.691008 | 2023-11-20 05:32:26.691008 | 2023-11-21 08:30:20.881008 |
69c0 | 15 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691008 |
69c0 | 14 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-04 05:32:26.691008 |
69c0 | 9 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691010 |
69c0 | 18 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691011 |
69c0 | 6 | 2023-11-20 05:32:26.691008 | 2023-11-21 08:30:20.881008 | NaT |
69c0 | 19 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-11 05:32:26.691008 |
db26 | 11 | 2023-08-01 10:40:48.473828 | 2023-08-01 10:40:48.473828 | |
db26 | 12 | 2023-08-01 10:40:48.473828 | 2023-08-01 10:40:48.473828 |
预期yields
application_id | id | creation_timestamp | selected_time | rejected_time | user_rejects |
---|---|---|---|---|---|
69c0 | 7 | 2023-11-20 05:32:26.691008 | 2023-11-20 05:32:26.691008 | 2023-11-21 08:30:20.881008 | 1 |
69c0 | 15 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691008 | 0 |
69c0 | 14 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-04 05:32:26.691008 | 0 |
69c0 | 9 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691010 | 0 |
69c0 | 18 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-20 05:32:26.691011 | 0 |
69c0 | 6 | 2023-11-20 05:32:26.691008 | 2023-11-21 08:30:20.881008 | NaT | 0 |
69c0 | 19 | 2023-11-20 05:32:26.691008 | NaT | 2023-12-11 05:32:26.691008 | 0 |
db26 | 11 | 2023-08-01 10:40:48.473828 | 2023-08-01 10:40:48.473828 | 0 | |
db26 | 12 | 2023-08-01 10:40:48.473828 | 2023-08-01 10:40:48.473828 | 0 |