我有一些包含多列的时间序列数据:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
5 2022-09-16 desk IL opera type-s 30
6 2022-09-16 desk NY opera type-s 0
7 2022-11-01 desk NJ opera type-s 0
8 2022-11-01 desk IL opera type-s 0
我想删除忽略日期的连续重复项,例如
...
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
...
索引为2、4的行是重复的,我只想保留第一行.
我试了两种方法,
- 使用也丢弃非连续副本的
drop_duplicates
, 和 - 使用
shift() !=
,我找到了in this answer个,似乎也在做同样的事情.
以下是我的测试代码:
import pandas as pd
from io import StringIO
data = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2021-08-25,desk,IL,opera,type-s,100
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""
df = pd.read_csv(StringIO(data), parse_dates=['date'])
print('\nOriginal \n', df.to_string())
columns = ['hardware', 'location', 'group', 'rtype']
dedup_cols = columns + ['value']
df = df.drop_duplicates(
subset=dedup_cols, keep='first').reset_index(drop=True)
de_dup = df.loc[(df[dedup_cols].shift() != df[dedup_cols]).any(axis=1)]
print('\nDrop_duplicates: \n', df.to_string())
print('\nShift_filter: \n', de_dup.to_string())
expected = """
date,hardware,location,group,rtype,value
2021-03-01,desk,NY,opera,type-s,0
2021-03-01,desk,NJ,opera,type-s,200
2021-03-01,desk,IL,opera,type-s,100
2022-08-01,desk,NY,opera,type-s,275
2022-09-16,desk,IL,opera,type-s,30
2022-09-16,desk,NY,opera,type-s,0
2022-11-01,desk,NJ,opera,type-s,0
2022-11-01,desk,IL,opera,type-s,0
"""
expected_df = pd.read_csv(StringIO(expected), parse_dates=['date'])
print('Equal drop_duplicates ? ', expected_df.equals(df))
print('Equal shift_filter ? ', expected_df.equals(de_dup))
print('Same result from the two techniques? ', de_dup.equals(df))
print('\nExpected:\n', expected_df.to_string())
-你在做什么?
Original
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2021-08-25 desk IL opera type-s 100
5 2022-09-16 desk IL opera type-s 30
6 2022-09-16 desk NY opera type-s 0
7 2022-11-01 desk NJ opera type-s 0
8 2022-11-01 desk IL opera type-s 0
Drop_duplicates:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-11-01 desk NJ opera type-s 0
6 2022-11-01 desk IL opera type-s 0
Shift_filter:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-11-01 desk NJ opera type-s 0
6 2022-11-01 desk IL opera type-s 0
Equal drop_duplicates ? False
Equal shift_filter ? False
Same result from the two techniques? True
Expected:
date hardware location group rtype value
0 2021-03-01 desk NY opera type-s 0
1 2021-03-01 desk NJ opera type-s 200
2 2021-03-01 desk IL opera type-s 100
3 2022-08-01 desk NY opera type-s 275
4 2022-09-16 desk IL opera type-s 30
5 2022-09-16 desk NY opera type-s 0
6 2022-11-01 desk NJ opera type-s 0
7 2022-11-01 desk IL opera type-s 0
我认为drop_duplicates
的方法是不合适的.
我希望第二种方法能行得通,但我不完全明白为什么行不通.
我在想一些group_by
+shift
的方法,然后取一个连续的差值,然后在差值为0的地方下降.有什么主意吗?