我的目标是使用Pandas在Python中连接来自不同来源的两个数据帧,然后用同一列中的相应值填充列中的空值.

这些数据库框架有类似的列,但是由于数据源的差异,某些文本/对象列可能具有不同的值.例如,一个相框中的"Name"列可能包含"Nick M.另一个是尼克·梅森但是,某些列,如"Date"(格式为YYYY—MM—DD)、"Order ID"(数字)和"Employee ID"(数字)在两个框架中具有一致的值(我们基于它们连接框架).值得一提的是,有些列甚至可能不存在于一个或另一个框架中,但也应该填写.

import pandas as pd

# Create DataFrame df1

df1_data = {

'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]

}

df1 = pd.DataFrame(df1_data)

# Create DataFrame df2

df2_data = {

'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],  
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],  
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],  
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]  

}

df2 = pd.DataFrame(df2_data)

# Combine DataFrames

outer_joined_df = pd.merge(

                            df1,
                            df2,
                            how = 'outer',
                            left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
                            right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]

                        )

# Display the result

outer_joined_df

下面是连接的帧的输出.应填写黄色的值.

enter image description here

我try 了下面的代码,它适用于Date、Order Id和Employee Id列,正如预期的那样(因为它们在两个邮箱中是相同的,我们基于它们加入),但不适用于其他,因为它们可能有不同的值.基本上,这段代码中的逻辑是,如果不执行,则填充指定列中同一行的值.然而,由于值可能不同,填充列变得混乱,因为它有相同值的多个变体.

outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])

outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])

outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])

outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])

outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])

下面是输出:

enter image description here

如您所见,它填充了数据,但不是我想要的方式.

我需要的输出:

enter image description here

推荐答案

# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]

# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
    relations = (
        outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
        .drop_duplicates()
        .dropna()
        .to_dict("tight")
        .get("data")
    )
    value_relations[col] = {k: v for k, v in relations}
    value_relations[col].update({v: k for k, v in relations})

    # fill values of df1 with the related value of df2
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"].map(value_relations[col])
    )
    # fill values of df2 with the related value of df1
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"].map(value_relations[col])
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             NaN              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0

如果要填充剩余的空值,请在每个循环的末尾添加以下内容:

    # fill remaining null values of df1
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"]
    )
    # fill remaining null values of df2
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"]
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        24.0
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             5.0             825.0  Nick Mason     San Diego        56.0
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0

Python相关问答推荐

如何从具有多个嵌入选项卡的网页中Web抓取td类元素

点到面的Y距离

根据在同一数据框中的查找向数据框添加值

什么相当于pytorch中的numpy累积ufunc

在Python中管理打开对话框

如何在Django基于类的视图中有效地使用UTE和RST HTIP方法?

如果值发生变化,则列上的极性累积和

Python—从np.array中 Select 复杂的列子集

如何使Matplotlib标题以图形为中心,而图例框则以图形为中心

使用Python从URL下载Excel文件

基于行条件计算(pandas)

关于两个表达式的区别

如何过滤组s最大和最小行使用`transform`'

多个矩阵的张量积

如何防止html代码出现在quarto gfm报告中的pandas表之上

Python:从目录内的文件导入目录

如何根据一定条件生成段id

Polars时间戳同步延迟计算

两个名称相同但值不同的 Select 都会产生相同的值(discord.py)

如何将django url参数传递给模板&S url方法?