我的数据框如下所示:
df1 = spark.createDataFrame([(1, "a"), (1, "b"), (1, "c")], ("col1", "col2"))
+----+----+
|col1|col2|
+----+----+
| 1| a|
| 1| b|
| 1| c|
+----+----+
df2 = spark.createDataFrame([(1, "k1"), (1, "k2"), (1, "k3"),(1,"k4")], ("col1", "col3"))
+----+----+
|col1|col3|
+----+----+
| 1| k1|
| 1| k2|
| 1| k3|
| 1| k4|
+----+----+
我想要生成
df3 = spark.createDataFrame([(1, "a", "k1"), (1, "b", "k2"), (1, "c", "k3"),(1, None, "k4")], ("col1", "col2", "col3"))
即所需输出:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| k1|
| 1| b| k2|
| 1| c| k3|
| 1|null| k4|
+----+----+----+
我试了df1.join(df2, on='col1', how="leftouter")
次,得到了:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| k4|
| 1| a| k3|
| 1| a| k2|
| 1| a| k1|
| 1| b| k4|
| 1| b| k3|
| 1| b| k2|
| 1| b| k1|
| 1| c| k4|
| 1| c| k3|
| 1| c| k2|
| 1| c| k1|
+----+----+----+
我查了Merge rows from one dataframe that do not match specific columns in another dataframe Python Pandas个.这几乎就是我想要的.但是,它使用的是Pandas DF.我不确定从spark DF切换到Pandas DF来做这个手术是不是一个好主意.有没有一种原生的喷雾方式来做到这一点?
在提供所需输出的转换方面需要帮助.