我的目标是使用Pandas在Python中连接来自不同来源的两个数据帧,然后用同一列中的相应值填充列中的空值.
这些数据库框架有类似的列,但是由于数据源的差异,某些文本/对象列可能具有不同的值.例如,一个相框中的"Name"列可能包含"Nick M.另一个是尼克·梅森但是,某些列,如"Date"(格式为YYYY—MM—DD)、"Order ID"(数字)和"Employee ID"(数字)在两个框架中具有一致的值(我们基于它们连接框架).值得一提的是,有些列甚至可能不存在于一个或另一个框架中,但也应该填写.
import pandas as pd
# Create DataFrame df1
df1_data = {
'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]
}
df1 = pd.DataFrame(df1_data)
# Create DataFrame df2
df2_data = {
'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]
}
df2 = pd.DataFrame(df2_data)
# Combine DataFrames
outer_joined_df = pd.merge(
df1,
df2,
how = 'outer',
left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]
)
# Display the result
outer_joined_df
下面是连接的帧的输出.应填写黄色的值.
我try 了下面的代码,它适用于Date、Order Id和Employee Id列,正如预期的那样(因为它们在两个邮箱中是相同的,我们基于它们加入),但不适用于其他,因为它们可能有不同的值.基本上,这段代码中的逻辑是,如果不执行,则填充指定列中同一行的值.然而,由于值可能不同,填充列变得混乱,因为它有相同值的多个变体.
outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])
outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])
outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])
outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])
outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])
下面是输出:
如您所见,它填充了数据,但不是我想要的方式.
我需要的输出: