Problem个
很抱歉,所有帮助过我的人,但我不得不重新措辞这个问题.我有一个数据帧,其中包含除最后一列以外的大多数列的重复项.在有重复项的情况下,我想应用以下规则:
- 如果最后一列中的两个条目都有效,则两个都保留.
- 如果最后一列中的两个条目都为空,则保留一个.
- 如果一个条目有效,另一个条目为空,则保留有效条目.
然后,我想取出重复的值,并用它们创建一个单独的数据帧.目前,我的方法很费力,会删除两个都为空的重复项.
Reprex个
100
import pandas as pd
import numpy as np
data_input = {'Student': ['A', 'A', 'B', 'B', 'C', 'D', 'E', 'F', 'F', 'G', "H", "H", "I", "I"],
"Subject": ["Law", "Law", "Maths", "Maths", "Maths", "Law", "Maths", "Music", "Music", "Music", "Art", "Art", "Dance", "Dance"],
"Checked": ["Bob", "James", np.nan, "Jack", "Laura", "Laura", np.nan, np.nan, "Tim", "Tim", "Tim", np.nan, np.nan, np.nan]}
# Create DataFrame
df1 = pd.DataFrame(data_input)
100
100
attempt1 = df1.sort_values(['Student', 'Checked'], ascending=False).drop_duplicates(["Student", "Subject"]).sort_index()
这是我从Stack上的另一个Q&Amp;A中拿来的,但它没有给我我想要的结果,我也不明白.
100
#Create Duplicate column
df1["Duplicates"] = df1.duplicated(subset=["Student", "Subject"], keep=False)
#Create list of rows with no duplicates
df_new1 = df1[df1["Duplicates"]==False]
#Create list of rows with duplicates & remove all those with null values
#HERE IS WHERE I GET STUCK. IF BOTH DUPLICATES ARE NULLS, I WANT TO KEEP ONE OF THEM
df_new2 = df1[df1["Duplicates"]==True]
df_new3 = df_new2[~df_new2["Checked"].isnull()]
#Combine unique rows, and duplicates without null values
#Keep duplicates without null values
df_new = df_new1.append(df_new3)
#Tidy up
df_new = df_new[["Student", "Subject", "Checked"]].sort_values(by="Student")
df_new
100
#Create separate list of duplicates with valid "Checked" values
df_new["Duplicates"] = df_new.duplicated(subset="Student", keep=False)
conflicting_duplicates = df_new[df_new["Duplicates"]==True]
conflicting_duplicates
100
谢谢大家!您的回答有帮助,但我没有意识到我还想保留其中一个条目,其中两个条目都为空.
有没有更好的方法来做这件事?