Python 基于非零列的行式聚合

发布于09月07日

我面临的场景是，我有一个稀疏的数据集(对于每一行，同时只填充4列，其余的将是零).对于每个变量前缀，将有3个变量(例如a_qty、a_height、ma_width)，列数完全是任意的，因为随时可能添加新的前缀.

举个例子，数据帧可能类似于:

import polars as pl

df = pl.DataFrame(
    {
        "some_id": ["x", "y", "z"],
        "a_qty": [1, 0, 0],
        "a_height": [2, 0, 0],
        "a_width": [3, 0, 0],
        "b_qty": [0, 3, 0],
        "b_height": [0, 5, 0],
        "b_width": [0, 8, 0],
        "c_qty": [0, 0, 10],
        "c_height": [0, 0, 11],
        "c_width": [0, 0, 12],
    }
)
# prints
┌─────────┬───────┬──────────┬─────────┬───┬─────────┬───────┬──────────┬─────────┐
│ some_id ┆ a_qty ┆ a_height ┆ a_width ┆ … ┆ b_width ┆ c_qty ┆ c_height ┆ c_width │
│ ---     ┆ ---   ┆ ---      ┆ ---     ┆   ┆ ---     ┆ ---   ┆ ---      ┆ ---     │
│ str     ┆ i64   ┆ i64      ┆ i64     ┆   ┆ i64     ┆ i64   ┆ i64      ┆ i64     │
╞═════════╪═══════╪══════════╪═════════╪═══╪═════════╪═══════╪══════════╪═════════╡
│ x       ┆ 1     ┆ 2        ┆ 3       ┆ … ┆ 0       ┆ 0     ┆ 0        ┆ 0       │
│ y       ┆ 0     ┆ 0        ┆ 0       ┆ … ┆ 8       ┆ 0     ┆ 0        ┆ 0       │
│ z       ┆ 0     ┆ 0        ┆ 0       ┆ … ┆ 0       ┆ 10    ┆ 11       ┆ 12      │
└─────────┴───────┴──────────┴─────────┴───┴─────────┴───────┴──────────┴─────────┘

我正在寻找一种方法来缩小数据集的范围，方法是减少列数并go 掉给定行的0.目标数据帧的每一行应包含:

qty:给定行中包含qty的列的总和(同样适用于width和height)
is_prod_*根据非零列的前缀将伪变量设置为1.前缀不能混合(每行只有一个字母)

df_target = pl.DataFrame(
    {
        "some_id": ["x", "y", "z"],
        "height": [2, 5, 11],
        "width": [3, 8, 12],
        "qty": [1, 3, 10],
        "is_prod_a": [1, 0, 0],
        "is_prod_b": [0, 1, 0],
        "is_prod_c": [0, 0, 1]
    }
)
# prints
┌─────────┬────────┬───────┬─────┬────────┬────────┬────────┐
│ some_id ┆ height ┆ width ┆ qty ┆ prod_a ┆ prod_b ┆ prod_c │
│ ---     ┆ ---    ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ ---    │
│ str     ┆ i64    ┆ i64   ┆ i64 ┆ i64    ┆ i64    ┆ i64    │
╞═════════╪════════╪═══════╪═════╪════════╪════════╪════════╡
│ x       ┆ 2      ┆ 3     ┆ 1   ┆ 1      ┆ 0      ┆ 0      │
│ y       ┆ 5      ┆ 8     ┆ 3   ┆ 0      ┆ 1      ┆ 0      │
│ z       ┆ 11     ┆ 12    ┆ 10  ┆ 0      ┆ 0      ┆ 1      │
└─────────┴────────┴───────┴─────┴────────┴────────┴────────┘

我try 过按行号分组和按类型求和变量:

width_cols = [col for col in df_test.columns if "width" in col],
height_cols = [col for col in df_test.columns if "height" in col]
qty_cols = [col for col in df_test.columns if "qty" in col]

(
    df
    .with_row_count()
    .group_by("row_nr")
    .agg([
        pl.col(*width_cols).sum().alias('width')
    ])
)

# DuplicateError: column with name 'width' has more than one occurrences

这显然不起作用，因为我试图在第二行创建一个重复的列.实现这种聚合最优雅的方式是什么？

编辑1

使用pl.DataFrame.fold可以得到逐行的列和，这似乎可以做到这一点.仍然不确定如何根据条件创建名为base的伪列.

(
    df_test
    .with_columns([
        pl.fold(0, lambda acc, s: acc + s, pl.col(*width_cols)).alias("width"),
        pl.fold(0, lambda acc, s: acc + s, pl.col(*height_cols)).alias("height"),
        pl.fold(0, lambda acc, s: acc + s, pl.col(*qty_cols)).alias("qty")
    ])
)

shape: (27, 3) ┌─────────┬──────────┬───────┐ │ some_id ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════════╪══════════╪═══════╡ │ x ┆ a_qty ┆ 1 │ │ y ┆ a_qty ┆ 0 │ │ z ┆ a_qty ┆ 0 │ │ x ┆ a_height ┆ 2 │ │ … ┆ … ┆ … │ │ z ┆ c_height ┆ 11 │ │ x ┆ c_width ┆ 0 │ │ y ┆ c_width ┆ 0 │ │ z ┆ c_width ┆ 12 │ └─────────┴──────────┴───────┘

df_out = ( df .melt('some_id') .filter(pl.col('value') > 0) .select( pl.exclude('variable'), pl.col('variable').str.extract('_(\w+)$'), pl.col('variable').str.extract('^(\w)_').alias('prefix') ) .pivot(index=['some_id', 'prefix'], columns='variable', values='value') )

shape: (3, 5) ┌─────────┬────────┬─────┬────────┬───────┐ │ some_id ┆ prefix ┆ qty ┆ height ┆ width │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════════╪════════╪═════╪════════╪═══════╡ │ x ┆ a ┆ 1 ┆ 2 ┆ 3 │ │ y ┆ b ┆ 3 ┆ 5 ┆ 8 │ │ z ┆ c ┆ 10 ┆ 11 ┆ 12 │ └─────────┴────────┴─────┴────────┴───────┘

shape: (3, 7) ┌─────────┬─────┬────────┬───────┬───────────┬───────────┬───────────┐ │ some_id ┆ qty ┆ height ┆ width ┆ is_prod_a ┆ is_prod_b ┆ is_prod_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════════╪═════╪════════╪═══════╪═══════════╪═══════════╪═══════════╡ │ x ┆ 1 ┆ 2 ┆ 3 ┆ 1 ┆ 0 ┆ 0 │ │ y ┆ 3 ┆ 5 ┆ 8 ┆ 0 ┆ 1 ┆ 0 │ │ z ┆ 10 ┆ 11 ┆ 12 ┆ 0 ┆ 0 ┆ 1 │ └─────────┴─────┴────────┴───────┴───────────┴───────────┴───────────┘

Python 基于非零列的行式聚合

编辑1

推荐答案

Python相关问答推荐

七段显示不完整

预期LP_c_Short实例而不是_ctyles.PyCStructType

保留包含pandas pandras中文本的列

使用Python OpenCV的文本检测分割

使用imap-tools时错误，其邮箱地址包含域名中的非默认字符

为什么基于条件的过滤会导致pandas中的空数据框架？

更改Seaborn条形图中的x轴日期时间限制

基本链合同的地址是如何计算的？

强制venv在bin而不是收件箱文件夹中创建虚拟环境

如何从FDaGrid实例中删除某些函数？

配置Sweetviz以分析对象类型列，而无需转换

Pandas 填充条件是另一列

如何将ctyles.POINTER(ctyles.c_float)转换为int？

Pandas实际上如何对基于自定义的索引(integer和非integer)执行索引

max_of_three使用First_select、second_select、

难以在Manim中正确定位对象

可变参数数量的重载类型(args或kwargs)

Python—从np.array中 Select 复杂的列子集

pandas：对多级列框架的列进行排序/重新排序

在Python中使用yaml渲染(多行字符串)