我有一个Spark Dataframe(在Palantir Foundry),列有"c_温度".此列的每一行都包含一个JSON字符串,其架构如下:
{"TempCelsiusEndAvg":"24.33","TempCelsiusEndMax":"null","TempCelsiusEndMin":"null","TempCelsiusStartAvg":"22.54","TempCelsiusStartMax":"null","TempCelsiusStartMin":"null","TempEndPlausibility":"T_PLAUSIBLE","TempStartPlausibility":"T_PLAUSIBLE"}
我try 用以下代码提取新列"TempCelsiusEndAvg"和"TempCelsiusStartAvg"中的平均温度值(它们有时为"NULL",有时具有类似"24.33"的值):
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
def flat_json(sessions_finished):
df = sessions_finished
df = df.withColumn("new_temperature", F.col('c_temperature').cast(StringType())
df = df.withColumn("TempCelsiusEndAvg", F.get_json_object("c_Temperature", '$.TempCelsiusEndAvg'))
df = df.withColumn("TempCelsiusStartAvg", F.get_json_object("c_Temperature", '$.TempCelsiusStartAvg'))
return df
我想让新的栏目里充斥着这样的替身:
... +-----------------+-------------------+ ...
... |TempCelsiusEndAvg|TempCelsiusStartAvg| ...
... +-----------------+-------------------+ ...
... | 24.33| 22.54| ...
... +-----------------+-------------------+ ...
... | 29.28| 25.16| ...
... +-----------------+-------------------+ ...
... | null| null| ...
... +-----------------+-------------------+ ...
新的数据帧包含列,但它们只用空值填充.有人能帮我解决这个问题吗?
... +-----------------+-------------------+ ...
... |TempCelsiusEndAvg|TempCelsiusStartAvg| ...
... +-----------------+-------------------+ ...
... | null| null| ...
... +-----------------+-------------------+ ...
... | null| null| ...
... +-----------------+-------------------+ ...
... | null| null| ...
... +-----------------+-------------------+ ...
在这个帖子里还有一个 comments :[https://stackoverflow.com/questions/46084158/how-can-you-parse-a-string-that-is-json-from-an-existing-temp-table-using-pyspar],它描述了我的问题,但我不知道如何使用这些信息.