Python 从pandas框架中删除重复的子框架

发布于02月29日

例如，我有一个Pandas 数据帧

df_dupl = pd.DataFrame({
    'EVENT_TIME': ['00:01', '00:01', '00:01', '00:03', '00:03', '00:03', '00:06', '00:06', '00:06', '00:08', '00:08', '00:10', '00:10', '00:11', '00:11', '00:13', '00:13', '00:13'],
    'UNIQUE_ID': [123, 123, 123, 125, 125, 125, 123, 123, 123, 127, 127, 123, 123, 123, 123, 123, 123, 123],
    'Value1': ['A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'B', 'C', 'B', 'A', 'B', 'A'],
    'Value2': [0.3, 0.2, 0.2, 0.1, 1.3, 0.2, 0.3, 0.2, 0.2, 0.1, 1.3, 0.3, 0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})

我想删除与具有相同UNIQUE_ID的前一行(按EVENT_TIME)具有相同值的行的序列. 对于该示例，结果应如下所示:

df = pd.DataFrame({
    'EVENT_TIME': ['00:01', '00:01', '00:01', '00:03', '00:03', '00:03', '00:08', '00:08', '00:10', '00:10', '00:11', '00:11', '00:13', '00:13', '00:13'],
    'UNIQUE_ID': [123, 123, 123, 125, 125, 125, 127, 127, 123, 123, 123, 123, 123, 123, 123],
    'Value1': ['A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'B', 'C', 'B', 'A', 'B', 'A'],
    'Value2': [0.3, 0.2, 0.2, 0.1, 1.3, 0.2, 0.1, 1.3, 0.3, 0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
}).

应该删除时间为00:06的行，因为先前具有UNIQUE_ID 123(时间00:01)的子数据帧是相同的. 另一方面，应该保留时间为00:13的行-它们也与时间为00:01的行相同，但有其他行的UNIQUE_ID介于123之间. 关键是我想要比较整个子数据帧，而不是单行.

我可以通过使用以下功能达到预期的效果，但速度很慢.

def del_dupl_gr(df):
    out = []
    for x in df['UNIQUE_ID'].unique():
        prev_df = pd.DataFrame()
        for y in df[df['UNIQUE_ID'] == x]['EVENT_TIME'].unique():
            test_df = df[(df['UNIQUE_ID'] == x) & (df['EVENT_TIME'] == y)]
            if not test_df.iloc[:, 2:].reset_index(drop=True).equals(prev_df.iloc[:, 2:].reset_index(drop=True)):
                out.append(test_df)
                prev_df = test_df
    return pd.concat(out).sort_index().reset_index(drop=True)

真正的数据帧相当大(超过百万行)，这个循环需要很长时间.我相信肯定有合适的(或者至少是更快的)方法来做到这一点.

Results个

感谢所有提交的答案.我比较了他们的速度.在某些情况下，我略微编辑了这些方法，以产生完全相同的结果.因此，在所有SORT_VALUES方法中，我添加了KIND=‘STRATE’以确保保持顺序，并在末尾添加了.RESET_INDEX(DROP=True).

Method	1000 rows	10 000 rows	100 000 rows
original	556 ms	5.41 s	Not tested
mozway	1.24 s	10.1 s	Not tested
Andrej Kesely	696 ms	4.56 s	Not tested
Quang Hoang	11.3 ms	34.1 ms	318 ms

# the value columns value_cols = df.columns[2:] # groups are identified as `EVENT_TIME` and `UNIQUE_ID` groupby = df_dupl.groupby(['EVENT_TIME','UNIQUE_ID'])['Value1'] # these are the groups groups = groupby.ngroup() # enumeration within the groups enums = groupby.cumcount() # sizes of the groups - populated across the rows sizes = groupby.transform('size') dup = (df_dupl.groupby(['UNIQUE_ID',enums])[value_cols].shift(). # shift by enumeration within `UNIQUE_ID` .eq(df_dupl[value_cols]).all(axis=1) # equal the current rows .groupby(groups).transform('all') # identical across the groups & sizes.groupby([df_dupl['UNIQUE_ID'],enums]).diff().eq(0). # and the group size are equal too ) # output df_dupl.loc[~dup]

EVENT_TIME UNIQUE_ID Value1 Value2 0 00:01 123 A 0.3 1 00:01 123 B 0.2 2 00:01 123 A 0.2 3 00:03 125 A 0.1 4 00:03 125 B 1.3 5 00:03 125 A 0.2 9 00:08 127 A 0.1 10 00:08 127 B 1.3 11 00:10 123 A 0.3 12 00:10 123 B 0.2 13 00:11 123 C 0.3 14 00:11 123 B 0.2 15 00:13 123 A 0.3 16 00:13 123 B 0.2 17 00:13 123 A 0.2

Python 从pandas框架中删除重复的子框架

推荐答案

Python相关问答推荐

如何计算两极打印机中 * 所有列 * 的出现次数？

将整组数组拆分为最小值与最大值之和的子数组

如何在Python中并行化以下搜索？

优化器的运行顺序影响PyTorch中的预测

SQLAlchemy Like ALL ORM analog

Stacked bar chart from billrame

如何在Polars中从列表中的所有 struct 中 Select 字段？

改进大型数据集的框架性能

Python—转换日期：价目表到新行

如何在验证文本列表时使正则表达式无序？

按条件添加小计列

如何在Django模板中显示串行化器错误

操作布尔值的Series时出现索引问题

正在try 让Python读取特定的CSV文件

我如何处理超类和子类的情况

如何将验证器应用于PYDANC2中的EACHY_ITEM？

Groupby并在组内比较单独行上的两个时间戳

try 使用RegEx解析由标识多行文本数据的3行头组成的日志(log)文件

是否将列表分割为2？

如何将参数名作为参数传入到函数中？