我有一个大约10,000行的DataFrame,其中大约1,000行要么是重复的,要么几乎是重复的.
以下是一个简化的示例:
df = pd.DataFrame({'App': ['Slack', 'Candy Bomb', 'Facebook', 'Candy Bomb', 'Slack', 'Slack', 'Facebook'],
'Category': ['Business', 'Game', 'Social', 'Family', 'Business', 'Business', 'Social'],
'Rating': [4.4, 4.6, 4.1, 4.6, 4.4, 4.4, 3.9],
'Reviews': [1000, 30000, 5000, 30000, 950, 950, 5000]})
输出
App Category Rating Reviews
0 Slack Business 4.4 1000
1 Candy Bomb Game 4.6 30000
2 Facebook Social 4.1 5000
3 Candy Bomb Family 4.6 30000
4 Slack Business 4.4 950
5 Slack Business 4.4 950
6 Facebook Social 3.9 5000
例如,Slack的Reviews
列具有不同的值:
App Category Rating Reviews
0 Slack Business 4.4 1000
4 Slack Business 4.4 950
5 Slack Business 4.4 950
预期yields :Reviews
Candy Bomb的Category
列具有不同的值:
App Category Rating Reviews
1 Candy Bomb Game 4.6 30000
3 Candy Bomb Family 4.6 30000
预期yields :Category
How do I find the column with different values for each of the apps?(这样我就可以决定保留和删除哪一行.)
我试过这个:
target_row = df[df['App'] == 'Candy Bomb']
columns = df.columns
for column in columns:
dupl_result = target_row.duplicated(subset=column)
if dupl_result.iloc[0] == dupl_result.iloc[1]:
print(column)
输出:
Category
然而,上面的代码只适用于two行几乎重复的应用程序,而不适用于其他数量的大规模行.我试图以多种方式修改这段代码,但仍然不起作用.
有没有更简单或更好的方法来解决这个问题?
Note:我的问题与How to Detect Almost Duplicate Locations in a Pandas Dataframe?和Detecting almost duplicate rows不同.
更新#01:强调这个问题,让它更清晰.
更新#02:使预期输出更清晰