我想合并这些示例数据帧:

  1. 如何在新的df中获得最接近的匹配?
df1:

name           age      department
DJ Griffin     27       FD
Harris Smith   33       RD


df2:

name               age      department
D.J. Griffin III   27       FD
Harris Smith       33       RD
Miles Jones        58       RD

结果应该如下所示:

df3:

name         age      department   name_y
DJ Griffin     27       FD         D.J. Griffin III
Harris Smith   33       RD         Harris Smith

使用了Difflib,但出现错误,原因是dfs的长度不同.

import pandas as pd
import difflib

df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])

df2['name_y'] = df2['name']

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])

结果:

IndexError: list index out of range
  1. 如果还有一个45岁的哈里斯·史密斯,如何通过两列找到最接近的匹配?
For the duplicate Harris Smith case

df1:

name           age      department
DJ Griffin     27       FD
Harris Smith   33       RD
Harris Smith   45       BA

df2:

name               age      department
D.J. Griffin III   27       FD
Harris Smith       33       RD
Harris Smith       45       BA
Miles Jones        58       RD

结果应该如下所示:

df3:

name         age      department   name_y
DJ Griffin     27       FD         D.J. Griffin III
Harris Smith   33       RD         Harris Smith
Harris Smith   45       BA         Harris Smith
import pandas as pd
import difflib

df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])

df2['name_y'] = df2['name']

谢谢你的帮助.

推荐答案

如果匹配为零,则会出现问题,无法切片[0].

您可以使用:

df2['name'].apply(lambda x: next(iter(difflib.get_close_matches(x, df1['name'])), pd.NA))

df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])).str[0]

输出:

0      DJ Griffin
1    Harris Smith
2             NaN
Name: name, dtype: object

update:

df1.merge(df2[['name', 'age']]
 .assign(name_y=df2['name'],
         name=df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])))
 .explode('name')
 .drop_duplicates(),
 on=['name', 'age']
)

输出:

           name  age department            name_y
0    DJ Griffin   27         FD  D.J. Griffin III
1  Harris Smith   33         RD      Harris Smith
2  Harris Smith   45         BA      Harris Smith

Python-3.x相关问答推荐

谁能解释一下这个带邮编的多功能环路?

visual studio代码窗口中未激活虚拟环境11

为什么我无法在django中按月筛选事件?

匹配语句NaN

对大型数据框中的选定列进行重新排序

生成具有偶数个 0 和 1 的给定长度的所有二进制数

列出相同索引的Pandas

使用 RANSAC 在激光雷达点云中查找电力线

如何在Pandas 中按条件计算分组?

使用 selenium 加速网页抓取

spinbutton调整up/down箭头

python 3中的SQLAlchemy ER图

Python3 mysqlclient-1.3.6(又名 PyMySQL)的用法?

为什么 virtualenv 会有效地禁用 Python 3 制表符补全?

创建日志(log)文件

AttributeError:LinearRegression 对象没有属性coef_

如何为 anaconda python3 安装 gi 模块?

如何在 QGraphicsView 中启用平移和zoom

在 Ipython 中使用 Pylint (Jupyter-Notebook)

Python:如何在 Windows 资源管理器中打开文件夹(Python 3.6.2、Windows 10)