I have a spark dataframe like below. (just an example. My real data has millions of rows):
df = pd.DataFrame({'ZIP1': ['50069', '50069', '50704', '50704', '52403', '52403'],
'ZIP2': ['50704', '52403', '50069', '52403', '50069', '50704'],
'STATE': ['IA', 'IA', 'IA', 'IA', 'IA', 'IA'],
'REGION': ['MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST', 'MIDWEST'] } )
sdf = spark.createDataFrame(df)
ZIP1 ZIP2 STATE REGION
0 50069 50704 IA MIDWEST
1 50069 52403 IA MIDWEST
2 50704 50069 IA MIDWEST
3 50704 52403 IA MIDWEST
4 52403 50069 IA MIDWEST
5 52403 50704 IA MIDWEST
如果ZIP1
和ZIP2
列中的两个zipcode是相同的组合,则需要删除一行.例如,row 0
和row 2
,zipcodes只是相同的组合,但顺序相反.我需要删除row 0
或row 2
.同样,移除row 1
或row 4
....
有人知道如何在pyspark中实现这一点吗?需要Pyspark解决方案.如果有人能同时提供pyspark和python的解决方案,那就更好了.谢谢