我想合并这些示例数据帧:
- 如何在新的df中获得最接近的匹配?
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Miles Jones 58 RD
结果应该如下所示:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
使用了Difflib,但出现错误,原因是dfs的长度不同.
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
结果:
IndexError: list index out of range
- 如果还有一个45岁的哈里斯·史密斯,如何通过两列找到最接近的匹配?
For the duplicate Harris Smith case
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
Miles Jones 58 RD
结果应该如下所示:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
Harris Smith 45 BA Harris Smith
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
谢谢你的帮助.