在 python polars 中过滤和聚合列表字典

发布于05月10日

我得到了一个带有json字符串表示的数据帧:

df = pl.DataFrame({ 
        "json": [
            '{"x":[0,1,2,3], "y":[10,20,30,40]}',
            '{"x":[0,1,2,3], "y":[10,20,30,40]}',
            '{"x":[0,1,2,3], "y":[10,20,30,40]}'
        ] 
    })

shape: (3, 1)
┌───────────────────────────────────┐
│ json                              │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
└───────────────────────────────────┘

现在我要计算y的average，其中x > 0和x < 3分别为每行.

这是我目前的工作解决方案:

首先计算字符串-&gt；dict，然后创建一个按x过滤的数据帧.

df = df.with_columns([
    pl.col('json').apply(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col('x') < 3) & (pl.col('x') > 0))['y'].mean())
])

shape: (3, 1)
┌──────┐
│ json │
│ ---  │
│ f64  │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘

这可以很好地工作，但对于大型数据集，Apply函数会显著减慢这一过程.

有没有更优雅、更快捷的方法呢？

>>> df.with_columns(pl.col("json").str.json_extract()).unnest("json") shape: (3, 2) ┌─────────────┬────────────────┐ │ x ┆ y │ │ --- ┆ --- │ │ list[i64] ┆ list[i64] │ ╞═════════════╪════════════════╡ │ [0, 1, … 3] ┆ [10, 20, … 40] │ │ [0, 1, … 3] ┆ [10, 20, … 40] │ │ [0, 1, … 3] ┆ [10, 20, … 40] │ └─────────────┴────────────────┘

(df.with_row_count() .with_columns(pl.col("json").str.json_extract()) .unnest("json") .explode("x", "y") .filter(pl.col("x").is_between(1, 2)) .groupby("row_nr") .agg(pl.mean("y")))

shape: (3, 2) ┌────────┬──────┐ │ row_nr ┆ y │ │ --- ┆ --- │ │ u32 ┆ f64 │ ╞════════╪══════╡ │ 0 ┆ 25.0 │ │ 1 ┆ 25.0 │ │ 2 ┆ 25.0 │ └────────┴──────┘

(df.with_columns(pl.col("json").str.json_extract()) .unnest("json") .select( pl.col("y").arr.take( pl.col("x").arr.eval(pl.element().is_between(1, 2).arg_true()) ).arr.mean() ) )

在 python polars 中过滤和聚合列表字典

推荐答案

Python相关问答推荐

如何在Python中使用ijson解析SON期间检索文件位置？

是pandas.DataFrame使用方法查询后仍然排序吗？

使用多个性能指标执行循环特征消除

从管道将Python应用程序部署到Azure Web应用程序，不包括需求包

无法使用equals_html从网址获取全文

如何在Python中使用io.BytesIO写入现有缓冲区？

查找两极rame中组之间的所有差异

如何过滤包含2个指定子字符串的收件箱列名？

实现自定义QWidgets作为QTimeEdit的弹出窗口

创建可序列化数据模型的最佳方法

不允许访问非IPM文件夹

调用decorator返回原始函数的输出

为什么Django管理页面和我的页面的其他CSS文件和图片都找不到？'

Gekko中基于时间的间隔约束

无法在Spyder上的Pandas中将本地CSV转换为数据帧

如何强制向量中的特定元素在Gekko中处于优化解决方案中

jsonschema日期格式

SpaCy：Regex模式在基于规则的匹配器中不起作用

Pandas数据框上的滚动平均值，其中平均值的中心基于另一数据框的时间

PYTHON中的selenium不会打开 chromium URL