Python 查找两极rame中组之间的所有差异

发布于04月16日

我有一个Polars Butrame，我正在try 在一个键上查找组之间的多个列上的差异(值已更改的字段).框架中可以有多个组和多个列.这些组本质上是int格式的日期时间(yyyyyyMMED)

如何找到日期上(任何)列存在新值的行？

样本数据:

raw_df = pl.DataFrame([
    {'id': 'AAPL','update_time': 20241112,'status':'trading', 'underlying': 'y'},
    {'id': 'MSFT','update_time': 20241113,'status': 'trading', 'underlying': 'x'},
    {'id': 'NVDA','update_time': 20241112,'status': 'trading', 'underlying': 'z'},
    {'id': 'MSFT','update_time': 20241112,'status': 'pending','underlying': 'x'},
    {'id': 'AAPL','update_time': 20241113,'status': 'trading', 'underlying': 'y'},
    {'id': 'NVDA','update_time': 20241113,'status': 'trading', 'underlying': 'z'},
    {'id': 'TSLA','update_time': 20241112,'status': 'closed', 'underlying': 'v'},
    ]
)

expected_df = pl.DataFrame([
    {'id': 'MSFT','update_time': 20241112,'status':'pending', 'underlying': 'x'},
    {'id': 'MSFT','update_time': 20241113,'status': 'trading', 'underlying': 'x'},
    ]
)

以下是数据输入的样子.

shape: (7, 4)
┌──────┬─────────────┬─────────┬────────────┐
│ id   ┆ update_time ┆ status  ┆ underlying │
│ ---  ┆ ---         ┆ ---     ┆ ---        │
│ str  ┆ i64         ┆ str     ┆ str        │
╞══════╪═════════════╪═════════╪════════════╡
│ AAPL ┆ 20241112    ┆ trading ┆ y          │
│ MSFT ┆ 20241113    ┆ trading ┆ x          │
│ NVDA ┆ 20241112    ┆ trading ┆ z          │
│ MSFT ┆ 20241112    ┆ pending ┆ x          │
│ AAPL ┆ 20241113    ┆ trading ┆ y          │
│ NVDA ┆ 20241113    ┆ trading ┆ z          │
│ TSLA ┆ 20241112    ┆ closed  ┆ v          │
└──────┴─────────────┴─────────┴────────────┘

下面是预期结果，显示id、更新时间和已更改的字段.如果更改/更新了1个以上的字段，理想情况下它应该进入新行.

我正在try 查找已更改的字段，并按键"id"上的"Update_time"分组.一个警告是，可能存在id"存在于一个组中，但不存在于另一个组中，例如"TSLA".因此，这些不常见或组之间交集的id可以被忽略.由于只有MSFT状态发生了更改，因此应将其过滤到已更新的两行.字段更改应仅对所有其他列进行，但Update_time是我们用来分组的列.

shape: (2, 3)
┌──────┬─────────────┬─────────┐
│ id   ┆ update_time ┆ status  │
│ ---  ┆ ---         ┆ ---     │
│ str  ┆ i64         ┆ str     │
╞══════╪═════════════╪═════════╡
│ MSFT ┆ 20241112    ┆ pending │
│ MSFT ┆ 20241113    ┆ trading │
└──────┴─────────────┴─────────┴

不知道如何做到这一点，但这是我拥有的最接近的，这对前面提到的警告无效.

def find_updated_field_differences(df):
        columns_to_check = [col for col in df.columns if col != 'id' and col != 'update_time']

        sorted_df = df.sort('update_time')
        grouped_df = sorted_df.groupby(["update_time"])
        
        result_data = []
        
        for group_key, group_df in grouped_df:
            print(group_df)
    
            for col in columns_to_check:

                group_df = group_df.with_columns(
                    (pl.col(col) != pl.col(col).shift()).alias(f"{col}_changed")
                )
            
            differing_rows = group_df.filter(
                pl.any([pl.col(f"{col}_changed") for col in columns_to_check])
            )


            result_data.append(differing_rows)
        
        differing_df = pl.concat(result_data)
        
        differing_df = differing_df.sort("id")
        
        return differing_df

import polars as pl def find_updated_field_differences(df): sorted_df = df.sort(['id', 'update_time']) shifted_df = sorted_df.shift(-1) mask = ( (sorted_df['id'] == shifted_df['id']) & ((sorted_df['status'] != shifted_df['status']) | (sorted_df['underlying'] != shifted_df['underlying'])) ) start_changes = sorted_df.filter(mask) end_changes = sorted_df.shift(-1).filter(mask) differing_df = pl.concat([start_changes, end_changes]).unique() return differing_df.sort(['id', 'update_time']) raw_df = pl.DataFrame([ {'id': 'AAPL', 'update_time': 20241112, 'status': 'trading', 'underlying': 'y'}, {'id': 'MSFT', 'update_time': 20241113, 'status': 'trading', 'underlying': 'x'}, {'id': 'NVDA', 'update_time': 20241112, 'status': 'trading', 'underlying': 'z'}, {'id': 'MSFT', 'update_time': 20241112, 'status': 'pending', 'underlying': 'x'}, {'id': 'AAPL', 'update_time': 20241113, 'status': 'trading', 'underlying': 'y'}, {'id': 'NVDA', 'update_time': 20241113, 'status': 'trading', 'underlying': 'z'}, {'id': 'TSLA', 'update_time': 20241112, 'status': 'closed', 'underlying': 'v'}, ]) result_df = find_updated_field_differences(raw_df) print(result_df)

shape: (2, 4) ┌──────┬─────────────┬─────────┬────────────┐ │ id ┆ update_time ┆ status ┆ underlying │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ str │ ╞══════╪═════════════╪═════════╪════════════╡ │ MSFT ┆ 20241112 ┆ pending ┆ x │ │ MSFT ┆ 20241113 ┆ trading ┆ x │

Python 查找两极rame中组之间的所有差异

推荐答案

Python相关问答推荐

计算所有前面行(当前行)中列的值

Python：在类对象内的字典中更改所有键的索引，而不是仅更改一个键

Pythind 11无法弄清楚如何访问tuple元素

max_of_three使用First_select、second_select、

沿着数组中的轴计算真实条目

如何访问所有文件，例如环境变量

使可滚动框架在tkinter环境中看起来自然

索引到 torch 张量，沿轴具有可变长度索引

判断solve_ivp中的事件

* 动态地 * 修饰Python中的递归函数

剪切间隔以添加特定日期

基于另一列的GROUP-BY聚合将列添加到Polars LazyFrame

如何将数据帧中的timedelta转换为datetime

在Python中从嵌套的for循环中获取插值

如果有2个或3个，则从pandas列中删除空格

GPT python SDK引入了大量开销/错误超时

在电影中向西北方向对齐""

如何获得3D点的平移和旋转，给定的点已经旋转？

每次查询的流通股数量

如果列包含空值，则PANAS查询不起作用