我是新来的.我想比较两张桌子.如果其中一列中的值不匹配,我希望在新列中打印出该列名.使用,Compare two dataframes Pyspark链接,我能够得到那个结果.现在,我想根据新创建的列过滤新表.

df1 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "UK"],
  [3, "GHI", 3000, "JPN"],
  [4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])

df2 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "CAN"],
  [3, "GHI", 3500, "JPN"],
  [4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])


from pyspark.sql.functions import *
#from pyspark.sql.functions import col, array, when, array_remove

# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']

select_expr =[
                col("id"), 
                *[df2[c] for c in df2.columns if c != 'id'], 
                array_remove(array(*conditions_), "").alias("column_names")
]

df3 = df1.join(df2, "id").select(*select_expr)
df3.show()

DF3:

+------+---------+--------+------+--------------+
|   id | |name  | sal  | Address | column_names |
+------+---------+--------+------+--------------+
|     1|  ABC   | 5000 | US      |  []          |
|     2|  DEF   | 4000 | CAN     |  [address]   |
|     3|  GHI   | 3500 | JPN     |  [sal]       |
|     4|  JKL_M | 4800 | CHN     |  [name,sal]  |
+------+---------+--------+------+--------------+

这是我收到错误消息的步骤.

df3.filter(df3.column_names!="")

Error: cannot resolve '(column_names = '')' due to data type mismatch: differing types in '(column_names = '')' (array<string> and string).

我想要以下结果

DF3:

+------+---------+--------+------+--------------+
|   id | |name  | sal  | Address | column_names |
+------+---------+--------+------+--------------+      
|     1|  DEF   | 4000 | CAN     |  [address]   |
|     2|  GHI   | 3500 | JPN     |  [sal]       |
|     3|  JKL_M | 4800 | CHN     |  [name,sal]  |
+------+---------+--------+------+--------------+

推荐答案

您收到错误,因为您正在比较数组类型与字符串,您应该首先将COLUMN_NAMES数组类型转换为字符串,然后它才会起作用

df3 = df3.withColumn('column_names',concat_ws(";",col("column_names")))

Python相关问答推荐

Pandas 群内滚动总和

如何处理必须存在于环境中但无法安装的Python项目依赖项?

Chatgpt API不断返回错误:404未能从API获取响应

使用scipy. optimate.least_squares()用可变数量的参数匹配两条曲线

Select 用a和i标签包裹的复选框?

大Pandas 胚胎中产生组合

使用FASTCGI在IIS上运行Django频道

难以在Manim中正确定位对象

用合并列替换现有列并重命名

当从Docker的--env-file参数读取Python中的环境变量时,每个\n都会添加一个\'.如何没有额外的?

Pre—Commit MyPy无法禁用非错误消息

如何使用scipy的curve_fit与约束,其中拟合的曲线总是在观测值之下?

为一个组的每个子组绘制,

可以bcrypts AES—256 GCM加密损坏ZIP文件吗?

如何在Pyplot表中舍入值

AES—256—CBC加密在Python和PHP中返回不同的结果,HELPPP

搜索按钮不工作,Python tkinter

为什么'if x is None:pass'比'x is None'单独使用更快?

如何在Python 3.9.6和MacOS Sonoma 14.3.1下安装Pyregion

BeautifulSoup-Screper有时运行得很好,很健壮--但有时它失败了::可能这里需要一些更多的异常处理?