Python 将多列中的管道分隔值拆分为行

发布于08月15日

我有以下spark 源数据帧:

import pyspark

arrayData = [
        ('1','temperature','21', 'Celsius'),
        ('2','humidity|temperature','88|21.8', 'Percent|Celsius'),
        ('3','temperature','21.2', 'Celsius'),
        ('4','temperature','19.9', 'Celsius'),
        ('5','humidity','85.5', 'Percent')]
df = spark.createDataFrame(data=arrayData, schema = ['id','types','values','labels'])
df.show()

+----+--------------------+---------+---------------+
|  id|               types|   values|         labels|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|humidity|temperature|  88|21.8|Percent|Celsius|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

我希望将类型、值和标签分开，以便每个值都在其对应的行上.就像这样:

+----+--------------------+---------+---------------+
|  id|                type|    value|          label|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|         temperature|     21.8|        Celsius|
|   2|            humidity|       88|        Percent|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

我try 使用拆分和分解函数，但它 for each 值创建了一个新行:

from pyspark.sql.functions import explode,col,split
df_2 = df.withColumn("id",col("id"))\
         .withColumn("type",explode(split("types", "\|")))\
         .withColumn("value",explode(split("values", "\|")))\
         .withColumn("label",explode(split("labels", "\|")))\
df_2.show()

+----+--------------------+---------+---------------+
|  id|                type|    value|          label|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|         temperature|     21.8|        Celsius|
|   2|         temperature|     21.8|        Percent|
|   2|         temperature|       88|        Celsius|
|   2|         temperature|       88|        Percent|
|   2|            humidity|     21.8|        Celsius|
|   2|            humidity|     21.8|        Percent|
|   2|            humidity|       88|        Celsius|
|   2|            humidity|       88|        Percent|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

cols = ['types', 'values', 'labels'] arr = F.arrays_zip(*[F.split(c, '\|').alias(c) for c in cols]) df1 = df.withColumn('arr', F.explode(arr)) df1 = df1.select('id', *[F.col('arr')[c].alias(c) for c in cols])

df1.show() +---+-----------+------+-------+ | id| types|values| labels| +---+-----------+------+-------+ | 1|temperature| 21|Celsius| | 2| humidity| 88|Percent| | 2|temperature| 21.8|Celsius| | 3|temperature| 21.2|Celsius| | 4|temperature| 19.9|Celsius| | 5| humidity| 85.5|Percent| +---+-----------+------+-------+

Python 将多列中的管道分隔值拆分为行

推荐答案

Python相关问答推荐

aiohTTP与pytest的奇怪行为

Asyncio与队列的多处理通信-仅运行一个协程

手动为pandas中的列上色

为什么我的代码会进入无限循环？

Python中使用Delivercio进行多个请求

使用Python Great Expectations和python-oracledb

如何使用Tkinter创建两个高度相同的框架(顶部和底部)？

将numpy矩阵映射到字符串矩阵

Pandas 除以一列中出现的每个值

使用polars .滤镜进行切片速度比pandas .loc慢

仅从风格中获取 colored颜色循环

三个给定的坐标可以是矩形的点吗

如果条件为真，则Groupby.mean()

如何让剧作家等待Python中出现特定cookie(然后返回它)？

如何根据参数推断对象的返回类型？

如何使用根据其他值相似的列从列表中获取的中间值填充空NaN数据

如何从.cgi网站刮一张表到rame？

OR—Tools CP SAT条件约束

在vscode上使用Python虚拟环境时((env))

有没有一种ONE—LINER的方法给一个框架的每一行一个由整数和字符串组成的唯一id？