Python PySpark 迭代行并删除具有指定值的行

发布于07月13日

我有一个这样的数据帧

Column A	Column B
Hello	[{id: 1000, abbreviatedId: 1, name: “John", planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “China", Capital: “Bejing”}, otherId: 400, language: “Cantonese”, species: 23409, creature: “Human”}]
Bye	[{id: 2000, abbreviatedId: 2, name: “James", planet: “Earth”, solarsystem: “Milky Way”, universe: “this one”, continent: {id: 33, country: “Russia", Capital: “Moscow”}, otherId: 500, language: “Russian”, species: 12308, creature: “Human”}]

在写入外部位置之前，如何遍历数据帧的各行以删除所有包含country: "China"的行？

我试过了

if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
    df.write.format("delta").mode("overwrite").save("file://path/")

和

for row in df.rdd.collect():
    if df.select(array_contains(col("columnb.continent.country"), "China")) != True:
      df.drop(row)

df.write.format("delta").mode("overwrite").save("file://path/")

from pyspark.sql.functions import expr from pyspark.sql import Row df = spark.createDataFrame([ [ [ Row(**{"id": 1000, "abbreviatedId": 1, "name": "John", "planet": "Earth", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 33, "country": "China", "Capital": "Bejing"}), "otherId": 400, "language": "Cantonese", "species": 23409, "creature": "Human"}), Row(**{"id": 1001, "abbreviatedId": 2, "name": "Alex", "planet": "Mars", "solarsystem": "Milky Way", "universe": "this one", "continent": Row(**{"id": 34, "country": "Japan", "Capital": "Tokyo"}), "otherId": 400, "language": "Japanese", "species": 23409, "creature": "Human"}) ] ]], ["b"]) df.filter(expr("not exists(b, x -> x.continent.country == 'China')"))

Python PySpark 迭代行并删除具有指定值的行

推荐答案

Python相关问答推荐

更改matplotlib彩色条的字体并勾选标签？

TARete错误：类型对象任务没有属性模型'

Gekko：Spring-Mass系统的参数识别

不理解Value错误：在Python中使用迭代对象设置时必须具有相等的len键和值

用Python解密Java加密文件

无法使用requests或Selenium抓取一个href链接

如何在Python脚本中附加一个Google tab(已经打开)

在np数组上实现无重叠的二维滑动窗口

导入...从...混乱

为一个组的每个子组绘制，

在Python中，从给定范围内的数组中提取索引组列表的更有效方法

dask无groupby(ddf. agg([min，max])？''''

如何杀死一个进程，我的Python可执行文件以sudo启动？

如何在PythonPandas 中对同一个浮动列进行逐行划分？

在第一次调用时使用不同行为的re. sub的最佳方式

Django在一个不是ForeignKey的字段上加入'

PYTHON中的pd.wide_to_long比较慢

高效地计算数字数组中三行上三个点之间的Angular

无法使用请求模块从网页上抓取一些产品的名称

我如何为测试函数的参数化提供fixture 生成的数据？如果我可以的话，还有其他 Select 吗？