Python 如何使用表达式将字符串解压缩到Polars DataFrame中的多个列中

发布于04月10日

我有一个Polars DataFrame，其中包含一个列，字符串代表"稀疏"扇区曝光，如下所示:

pl.DataFrame(
    [
        pl.Series("sector_exposure", [
            'Technology=0.207;Financials=0.090;Health Care=0.084;Consumer Discretionary=0.069', 
            'Financials=0.250;Health Care=0.200;Consumer Staples=0.150;Industrials=0.400'
            ], dtype=pl.String),
    ]
)

sector_exposure
Technology=0.207;Financials=0.090;Health Care=0.084;Consumer Discretionary=0.069
Financials=0.250;Health Care=0.200;Consumer Staples=0.150;Industrials=0.400

我想把这个字符串"解包"到每个扇区的新列中(例如，技术、金融、医疗保健)，其具有相关值，或具有扇区名称作为字段和expose 值的polars struct .

我正在寻找一个更有效的解决方案，只使用polars表达式，而不诉诸Python循环(或Python映射函数).有谁能指导如何做到这一点？

这是我目前为止所提出的——它能产生所需的 struct ，但有点慢.

(
    df["sector_exposure"]
    .str
    .split(";")
    .map_elements(lambda x: {entry.split('=')[0]: float(entry.split('=')[1]) for entry in x},
                  skip_nulls=True,
                  )
)

谢谢！

Regex提取物

df.with_columns(pl.col('sector_exposure').str.extract(x+r"=(\d+\.\d+)").cast(pl.Float64).alias(x) for x in ["Technology", "Financials", "Health Care", "Consumer Discretionary", "Consumer Staples","Industrials"]) shape: (2, 7) ┌────────────────┬────────────┬────────────┬─────────────┬────────────────┬──────────┬─────────────┐ │ sector_exposur ┆ Technology ┆ Financials ┆ Health Care ┆ Consumer ┆ Consumer ┆ Industrials │ │ e ┆ --- ┆ --- ┆ --- ┆ Discretionary ┆ Staples ┆ --- │ │ --- ┆ f64 ┆ f64 ┆ f64 ┆ --- ┆ --- ┆ f64 │ │ str ┆ ┆ ┆ ┆ f64 ┆ f64 ┆ │ ╞════════════════╪════════════╪════════════╪═════════════╪════════════════╪══════════╪═════════════╡ │ Technology=0.2 ┆ 0.207 ┆ 0.09 ┆ 0.084 ┆ 0.069 ┆ null ┆ null │ │ 07;Financials= ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0.090;Health ┆ ┆ ┆ ┆ ┆ ┆ │ │ Care=0.084;Con ┆ ┆ ┆ ┆ ┆ ┆ │ │ sumer Discreti ┆ ┆ ┆ ┆ ┆ ┆ │ │ onary=0.069 ┆ ┆ ┆ ┆ ┆ ┆ │ │ Financials=0.2 ┆ null ┆ 0.25 ┆ 0.2 ┆ null ┆ 0.15 ┆ 0.4 │ │ 50;Health Care ┆ ┆ ┆ ┆ ┆ ┆ │ │ =0.200;Consume ┆ ┆ ┆ ┆ ┆ ┆ │ │ r Staples=0.15 ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0;Industrials= ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0.400 ┆ ┆ ┆ ┆ ┆ ┆ │ └────────────────┴────────────┴────────────┴─────────────┴────────────────┴──────────┴─────────────┘

在这一个，我们指望所有的数字是十进制(你可以调整regex，以绕过这一位)和所有扇区被预先指定的发生器在with_columns内，

拆分和枢轴

( df .with_columns(str_split=pl.col('sector_exposure').str.split(';')) .explode('str_split') .with_columns( pl.col('str_split') .str.split('=') .list.to_struct(fields=['sector','value']) ) .unnest('str_split') .pivot(values='value',index='sector_exposure',columns='sector',aggregate_function='first') .with_columns(pl.exclude('sector_exposure').cast(pl.Float64)) ) shape: (2, 7) ┌────────────────┬────────────┬────────────┬─────────────┬────────────────┬──────────┬─────────────┐ │ sector_exposur ┆ Technology ┆ Financials ┆ Health Care ┆ Consumer ┆ Consumer ┆ Industrials │ │ e ┆ --- ┆ --- ┆ --- ┆ Discretionary ┆ Staples ┆ --- │ │ --- ┆ f64 ┆ f64 ┆ f64 ┆ --- ┆ --- ┆ f64 │ │ str ┆ ┆ ┆ ┆ f64 ┆ f64 ┆ │ ╞════════════════╪════════════╪════════════╪═════════════╪════════════════╪══════════╪═════════════╡ │ Technology=0.2 ┆ 0.207 ┆ 0.09 ┆ 0.084 ┆ 0.069 ┆ null ┆ null │ │ 07;Financials= ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0.090;Health ┆ ┆ ┆ ┆ ┆ ┆ │ │ Care=0.084;Con ┆ ┆ ┆ ┆ ┆ ┆ │ │ sumer Discreti ┆ ┆ ┆ ┆ ┆ ┆ │ │ onary=0.069 ┆ ┆ ┆ ┆ ┆ ┆ │ │ Financials=0.2 ┆ null ┆ 0.25 ┆ 0.2 ┆ null ┆ 0.15 ┆ 0.4 │ │ 50;Health Care ┆ ┆ ┆ ┆ ┆ ┆ │ │ =0.200;Consume ┆ ┆ ┆ ┆ ┆ ┆ │ │ r Staples=0.15 ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0;Industrials= ┆ ┆ ┆ ┆ ┆ ┆ │ │ 0.400 ┆ ┆ ┆ ┆ ┆ ┆ │ └────────────────┴────────────┴────────────┴─────────────┴────────────────┴──────────┴─────────────┘

在这一个你做了一个"回合"分裂在分号，然后爆炸.然后，你又在平等上分裂，但你把它变成一个 struct ，然后拆除.从那里你将扇区向上旋转到列.

如果扇区以相同的顺序存在，那么你可以使用str.extract_groups个，但不同的顺序，我不认为它工作.

Python 如何使用表达式将字符串解压缩到Polars DataFrame中的多个列中

推荐答案

Regex提取物

拆分和枢轴

Python相关问答推荐

Odoo 14 hr. emergency.public内的二进制字段

将图像拖到另一个图像

为什么默认情况下所有Python类都是可调用的？

通过pandas向每个非空单元格添加子字符串

try 将一行连接到Tensorflow中的矩阵

如何调整QscrollArea以正确显示内部正在变化的Qgridlayout？

Pandas：将多级列名改为一级

从一个系列创建一个Dataframe，特别是如何重命名其中的列(例如：使用NAs/NaN)

根据列值添加时区

如何合并两个列表，并获得每个索引值最高的列表名称？

在Python中使用if else或使用regex将二进制数据如111转换为001""

matplotlib + python foor loop

合并与拼接并举

pysnmp—lextudio使用next()和getCmd()生成器导致TypeError：tuple对象不是迭代器''

Pandas：填充行并删除重复项，但保留不同的值

不允许 Select 北极滚动？

什么是一种快速而优雅的方式来转换一个包含一串重复的列，而不对同一个值多次运行转换，

一个telegram 机器人应该发送一个测验如何做？""

删除特定列后的所有列

为什么后跟inplace方法的`.rename(Columns={'；b'；：'；b'；}，Copy=False)`没有更新原始数据帧？