Python 具有多个选项的计数_匹配

发布于04月25日

我有一个这样的框架:

src = pl.DataFrame(
    {
        "c1": ["a", "b", "c", "d"],
        "c2": [[0], [2, 3, 4], [3, 4, 7, 9], [3, 9]],
    }
)

...以及目标列表:

targets = pl.Series([3, 7, 9])

...我想计算"c2"中目标的数量:

dst = pl.DataFrame(
    {
        "c1": ["a", "b", "c", "d"],
        "c2": [[0], [2, 3, 4], [3, 4, 7, 9], [3, 9]],
        "match_count": [0, 1, 3, 2],
    }
)

最有效的方法是什么？

我看到count_matches，但它不适用于多个选项:

df["c"].list.count_matches(3)          # OK.
df["c"].list.count_matches([3, 7, 9])  # No way.

推荐答案

您可以使用pl.Expr.list.eval来判断列表列中的每个元素，无论它是否包含在target中.然后，布尔列表的和给出匹配的数量.

dst.with_columns(
    pl.col("c2").list.eval(pl.element().is_in(targets)).list.sum().alias("res")
)

shape: (4, 4)
┌─────┬─────────────┬─────────────┬─────┐
│ c1  ┆ c2          ┆ match_count ┆ res │
│ --- ┆ ---         ┆ ---         ┆ --- │
│ str ┆ list[i64]   ┆ i64         ┆ u32 │
╞═════╪═════════════╪═════════════╪═════╡
│ a   ┆ [0]         ┆ 0           ┆ 0   │
│ b   ┆ [2, 3, 4]   ┆ 1           ┆ 1   │
│ c   ┆ [3, 4, … 9] ┆ 3           ┆ 3   │
│ d   ┆ [3, 9]      ┆ 2           ┆ 2   │
└─────┴─────────────┴─────────────┴─────┘

类似地，您可以使用pl.Expr.list.eval来过滤列c2，然后使用pl.Expr.list.len来查看还剩多少个值.如果这对性能至关重要，我建议对这两种方法进行基准测试.

或者，您确实对集合交集中的元素数量感兴趣(如@jqurious提到的)，或者至少一个列表中的元素是唯一的，您可以使用pl.Expr.list.set_intersection.

dst.with_columns(
    pl.col("c2")
    .list.set_intersection(targets.to_list())
    .list.len()
    .alias("res_intersection")
)

shape: (4, 4)
┌─────┬─────────────┬─────────────┬──────────────────┐
│ c1  ┆ c2          ┆ match_count ┆ res_intersection │
│ --- ┆ ---         ┆ ---         ┆ ---              │
│ str ┆ list[i64]   ┆ i64         ┆ u32              │
╞═════╪═════════════╪═════════════╪══════════════════╡
│ a   ┆ [0]         ┆ 0           ┆ 0                │
│ b   ┆ [2, 3, 4]   ┆ 1           ┆ 1                │
│ c   ┆ [3, 4, … 9] ┆ 3           ┆ 3                │
│ d   ┆ [3, 9]      ┆ 2           ┆ 2                │
└─────┴─────────────┴─────────────┴──────────────────┘