I have a rather wide dataset (700k rows and 100+ columns) with multiple entity_id and multiple datetime intervals.
There are many columns attr associated with different values.
I am trying to cut those intervals to integrate specific_dt for each of the entity_id.
When splitting time intervals, newly created intervals inherit their parents attr values.

下面是一个可复制的小例子

have = {'entity_id': [1,1,2,2], 
     'start_date': ['2014-12-01 00:00:00', '2015-03-01 00:00:00', '2018-02-12 00:00:00', '2019-02-01 00:00:00'], 
     'end_date': ['2015-02-28 23:59:59', '2015-05-31 23:59:59', '2019-01-31 23:59:59', '2023-05-28 23:59:59'],
     'attr1': ['A', 'B', 'D', 'J']}
have = pd.DataFrame(data=have)
have

   entity_id           start_date             end_date attr1
0          1  2014-12-01 00:00:00  2015-02-28 23:59:59     A
1          1  2015-03-01 00:00:00  2015-05-31 23:59:59     B
2          2  2018-02-12 00:00:00  2019-01-31 23:59:59     D
3          2  2019-02-01 00:00:00  2023-05-28 23:59:59     J
# Specific dates to integrate
specific_dt = ['2015-01-01 00:00:00', '2015-03-31 00:00:00']

预期输出如下

want

   entity_id start_date            end_date attr1
0          1 2014-12-01 2014-12-31 23:59:59     A
0          1 2015-01-01 2015-02-28 23:59:59     A
1          1 2015-03-01 2015-03-30 23:59:59     B
1          1 2015-03-31 2015-05-31 23:59:59     B
2          2 2018-02-12 2019-01-31 23:59:59     D
3          2 2019-02-01 2023-05-28 23:59:59     J

我已经能够通过以下代码获得所需的输出

# Create a list to store the new rows
new_rows = []

# Iterate through each row in the initial DataFrame
for index, row in have.iterrows():
    id_val = row['entity_id']
    start_date = pd.to_datetime(row['start_date'])
    end_date = pd.to_datetime(row['end_date'], errors = 'coerce')
    
    # Iterate through specific dates and create new rows
    for date in specific_dt:
        specific_date = pd.to_datetime(date)
        
        # Check if the specific date is within the interval
        if start_date < specific_date < end_date:
            # Create a new row with all columns and append it to the list
            new_row = row.copy()
            new_row['start_date'] = start_date
            new_row['end_date'] = specific_date - pd.Timedelta(seconds=1)
            new_rows.append(new_row)
            
            # Update the start_date for the next iteration
            start_date = specific_date
    
    # Add the last part of the original interval as a new row
    new_row = row.copy()
    new_row['start_date'] = start_date
    new_row['end_date'] = end_date
    new_rows.append(new_row)

# Create a new DataFrame from the list of new rows
want = pd.DataFrame(data=new_rows)

然而,对于我的工作数据集,它是extremely slow(10分钟以上). 是否有可能对其进行优化(也许是通过消除for循环)?


作为参考,我能够使用一个简单的数据步骤在sas秒内完成此操作(下面的示例是两个特定日期中的一个进行集成).

data want;
    set have;
    by entity_id start_date end_date;

    if start_date < "31MAR2015"d < end_date then
        do;
            retain _start _end;
            _start = start_date;
            _end = end_date;
            end_date = "30MAR2015"d;
            output;
            start_date = "31MAR2015"d;
            end_date = _end;
            output;
        end;
    else output;
    drop _start _end;
run;

推荐答案

你可以试试这个:

have["start_date"] = pd.to_datetime(have["start_date"])
have["end_date"] = pd.to_datetime(have["end_date"])

specific_dt = [
    pd.to_datetime("2015-01-01 00:00:00"),
    pd.to_datetime("2015-03-31 00:00:00"),
]

l = [have]
for dt in specific_dt:
    mask = (have["start_date"] < dt) & (have["end_date"] > dt)
    new_df = have.loc[mask]
    have.loc[mask, "end_date"] = dt - pd.Timedelta(seconds=1)
    new_df.loc[:, "start_date"] = dt
    l.append(new_df)

want = pd.concat(l).sort_values(["entity_id", "attr1"])
   entity_id start_date            end_date attr1
0          1 2014-12-01 2014-12-31 23:59:59     A
0          1 2015-01-01 2015-02-28 23:59:59     A
1          1 2015-03-01 2015-03-30 23:59:59     B
1          1 2015-03-31 2015-05-31 23:59:59     B
2          2 2018-02-12 2019-01-31 23:59:59     D
3          2 2019-02-01 2023-05-28 23:59:59     J

Python相关问答推荐

Python中的嵌套Ruby哈希

无法使用requests或Selenium抓取一个href链接

OR—Tools中CP—SAT求解器的IntVar设置值

从spaCy的句子中提取日期

在Python中,从给定范围内的数组中提取索引组列表的更有效方法

如何在TensorFlow中分类多个类

在Python中使用if else或使用regex将二进制数据如111转换为001""

matplotlib + python foor loop

使用字典或列表的值组合

Pandas 数据帧中的枚举,不能在枚举列上执行GROUP BY吗?

应用指定的规则构建数组

在我融化极点数据帧之后,我如何在不添加索引的情况下将其旋转回其原始形式?

你能把函数的返回类型用作其他地方的类型吗?'

Pandas:将值从一列移动到适当的列

如何在Pandas中用迭代器求一个序列的平均值?

如何获取给定列中包含特定值的行号?

在不中断格式的情况下在文件的特定部分插入XML标签

某些值的数值幂和**之间的差异

try 在单个WITH_COLUMNS_SEQ操作中链接表达式时,使用Polars数据帧时出现ComputeError

如何定义一个将类型与接收该类型的参数的可调用进行映射的字典?