我有以下spark 源数据帧:
import pyspark
arrayData = [
('1','temperature','21', 'Celsius'),
('2','humidity|temperature','88|21.8', 'Percent|Celsius'),
('3','temperature','21.2', 'Celsius'),
('4','temperature','19.9', 'Celsius'),
('5','humidity','85.5', 'Percent')]
df = spark.createDataFrame(data=arrayData, schema = ['id','types','values','labels'])
df.show()
+----+--------------------+---------+---------------+
| id| types| values| labels|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2|humidity|temperature| 88|21.8|Percent|Celsius|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+
我希望将类型、值和标签分开,以便每个值都在其对应的行上.就像这样:
+----+--------------------+---------+---------------+
| id| type| value| label|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2| temperature| 21.8| Celsius|
| 2| humidity| 88| Percent|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+
我try 使用拆分和分解函数,但它 for each 值创建了一个新行:
from pyspark.sql.functions import explode,col,split
df_2 = df.withColumn("id",col("id"))\
.withColumn("type",explode(split("types", "\|")))\
.withColumn("value",explode(split("values", "\|")))\
.withColumn("label",explode(split("labels", "\|")))\
df_2.show()
+----+--------------------+---------+---------------+
| id| type| value| label|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2| temperature| 21.8| Celsius|
| 2| temperature| 21.8| Percent|
| 2| temperature| 88| Celsius|
| 2| temperature| 88| Percent|
| 2| humidity| 21.8| Celsius|
| 2| humidity| 21.8| Percent|
| 2| humidity| 88| Celsius|
| 2| humidity| 88| Percent|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+