我有一些包含多列的时间序列数据:

         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0

我想删除忽略日期的连续重复项,例如

...
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
...

索引为2、4的行是重复的,我只想保留第一行.

我试了两种方法,

  1. 使用也丢弃非连续副本的drop_duplicates, 和
  2. 使用shift() != ,我找到了in this answer个,似乎也在做同样的事情.

以下是我的测试代码:

import pandas as pd
from io import StringIO

data = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2021-08-25,desk,IL,opera,type-s,100
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""

df = pd.read_csv(StringIO(data), parse_dates=['date'])

print('\nOriginal \n', df.to_string())

columns = ['hardware', 'location', 'group', 'rtype']


dedup_cols = columns + ['value']

df = df.drop_duplicates(
    subset=dedup_cols, keep='first').reset_index(drop=True)

de_dup = df.loc[(df[dedup_cols].shift() != df[dedup_cols]).any(axis=1)]

print('\nDrop_duplicates: \n', df.to_string())
print('\nShift_filter: \n', de_dup.to_string())

expected = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""

expected_df = pd.read_csv(StringIO(expected), parse_dates=['date'])

print('Equal drop_duplicates ? ', expected_df.equals(df))
print('Equal shift_filter ? ', expected_df.equals(de_dup))
print('Same result from the two techniques? ', de_dup.equals(df))
print('\nExpected:\n', expected_df.to_string())

-你在做什么?

Original 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2021-08-25     desk       IL  opera  type-s    100
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0

Drop_duplicates: 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-11-01     desk       NJ  opera  type-s      0
6 2022-11-01     desk       IL  opera  type-s      0

Shift_filter: 
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-11-01     desk       NJ  opera  type-s      0
6 2022-11-01     desk       IL  opera  type-s      0
Equal drop_duplicates ?  False
Equal shift_filter ?  False
Same result from the two techniques?  True

Expected:
         date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
4 2022-09-16     desk       IL  opera  type-s     30
5 2022-09-16     desk       NY  opera  type-s      0
6 2022-11-01     desk       NJ  opera  type-s      0
7 2022-11-01     desk       IL  opera  type-s      0

我认为drop_duplicates的方法是不合适的.

我希望第二种方法能行得通,但我不完全明白为什么行不通.

我在想一些group_by+shift的方法,然后取一个连续的差值,然后在差值为0的地方下降.有什么主意吗?

推荐答案

我想您想要考虑连续值per group.

然后使用groupby.shiftboolean indexing:

cols = ['hardware', 'location', 'group', 'rtype']
out = df.loc[df['value'].ne(df.groupby(cols)['value'].shift())]

输出:

        date hardware location  group   rtype  value
0 2021-03-01     desk       NY  opera  type-s      0
1 2021-03-01     desk       NJ  opera  type-s    200
2 2021-03-01     desk       IL  opera  type-s    100
3 2022-08-01     desk       NY  opera  type-s    275
5 2022-09-16     desk       IL  opera  type-s     30
6 2022-09-16     desk       NY  opera  type-s      0
7 2022-11-01     desk       NJ  opera  type-s      0
8 2022-11-01     desk       IL  opera  type-s      0

Python相关问答推荐

当密钥是复合且唯一时,Pandas合并抱怨标签不唯一

TARete错误:类型对象任务没有属性模型'

由于NEP 50,向uint 8添加-256的代码是否会在numpy 2中失败?

Django mysql图标不适用于小 case

图像 pyramid .难以创建所需的合成图像

使用groupby Pandas的一些操作

从groupby执行计算后创建新的子框架

PyQt5,如何使每个对象的 colored颜色 不同?'

avxspan与pandas period_range

在极性中创建条件累积和

ThreadPoolExecutor和单个线程的超时

有没有一种ONE—LINER的方法给一个框架的每一行一个由整数和字符串组成的唯一id?

提取相关行的最快方法—pandas

通过ManyToMany字段与Through在Django Admin中过滤

如何使用Numpy. stracards重新编写滚动和?

如果包含特定值,则筛选Groupby

如果有2个或3个,则从pandas列中删除空格

如何使用matplotlib查看并列直方图

应用指定的规则构建数组

我如何为测试函数的参数化提供fixture 生成的数据?如果我可以的话,还有其他 Select 吗?