Python 如何在 Pyspark 中的相同条件下更新具有不同值的两列

发布于09月13日

我有一个包含列a的DataFrame.我想在列a的基础上创建两个额外的列(b和c).我可以通过两次相同的操作来解决这个问题:

df = df.withColumn('b', when(df.a == 'something', 'x'))\
       .withColumn('c', when(df.a == 'something', 'y'))

我想避免同样的事情重复做，因为b和c更新的条件是相同的，而且a列的情况也很多.这个问题有没有更聪明的解决方案？"with Column"是否可以接受多个列？

推荐答案

在这种情况下，struct是最合适的.请参见下面的示例.

spark.sparkContext.parallelize([('something',), ('foobar',)]).toDF(['a']). \
    withColumn('b_c_struct', 
               func.when(func.col('a') == 'something', 
                         func.struct(func.lit('x').alias('b'), func.lit('y').alias('c'))
                         )
               ). \
    select('*', 'b_c_struct.*'). \
    show()

# +---------+----------+----+----+
# |        a|b_c_struct|   b|   c|
# +---------+----------+----+----+
# |something|    {x, y}|   x|   y|
# |   foobar|      null|null|null|
# +---------+----------+----+----+

只需在select后面使用drop('b_c_struct')即可删除 struct 列并保留各个字段.