我有一个旧的数据框(DFO),有人决定向其添加额外的列(注释),但此数据集没有键列.我还有一个新的DataFrame(DFN),它应该表示相同的数据,但没有NOTES列.我只是被要求将旧笔记转移到新的数据框中.我已经能够找到一些行的匹配项,但不是所有行.我想知道的是,在多个列上合并是否有其他技巧,或者是否有更适合的替代方案.
下面是未合并的原始CSV的示例数据,然后将其放入词典中,它工作得很好.
example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'],
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'],
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}
dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])
代码:
dfo = dfo # shape (10250, 10) the additional columns are notes.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']
dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data.
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups
# display results... dtypes
In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}
In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}
# Time to merge
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)
大约75%的数据正在合并.我做了抽查,有更多的数据可以合并,但事实并非如此.我还应该做什么才能让剩下的15%~25%合并? 如果您想查看CSV文件中的数据,我提供了一个链接. Github to csv files个