Python Pandas 如何合并多个列不起作用和其他解决方案定期合并

发布于09月16日

我有一个旧的数据框(DFO)，有人决定向其添加额外的列(注释)，但此数据集没有键列.我还有一个新的DataFrame(DFN)，它应该表示相同的数据，但没有NOTES列.我只是被要求将旧笔记转移到新的数据框中.我已经能够找到一些行的匹配项，但不是所有行.我想知道的是，在多个列上合并是否有其他技巧，或者是否有更适合的替代方案.

下面是未合并的原始CSV的示例数据，然后将其放入词典中，它工作得很好.

example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'], 
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'], 
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}

example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'], 
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}

dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])

代码:

dfo = dfo # shape (10250, 10) the additional columns are notes. 
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']

dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data 
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']

# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data. 
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))

# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups

# display results... dtypes

In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}

In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}

# Time to merge 
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)

大约75%的数据正在合并.我做了抽查，有更多的数据可以合并，但事实并非如此.我还应该做什么才能让剩下的15%~25%合并？如果您想查看CSV文件中的数据，我提供了一个链接. Github to csv files个

dfi = pd.merge(dfn, dfo, how="inner", on=subset) dfi['inner'] = 1 df_o_nm = pd.merge(dfo, dfi, how="left", on=subset) df_o_nm = df_oi.loc[df_oi['inner'] != 1][subset] # not merged data from old data frame

for index, row in df_o_nm.iterrows(): df_sel = dfn.loc[ (dfn['S1']==row['S1']) & (dfn['N1']==row['N1']) & (dfn['AC']==row['AC']) & (dfn['U']==row['U'])] if len(df_sel) == 0: print('no matching data subset found.') else: print(f'{len(df_sel)} rows matching subset of columns found') for idx, row_sel in df_sel.iterrows(): for col in ['DateTime', 'LN', 'PS', 'TI']: if row[col] != row_sel[col]: print(f"{idx}: {col}: {row[col]} --> {row_sel[col]}") print('---')

Python Pandas 如何合并多个列不起作用和其他解决方案定期合并

推荐答案

Python相关问答推荐

合并同名列，但一列为空，另一列包含值

这家einsum运营在做什么？E = NP.einsum(aj，kl-il，A，B)

ambda将时间戳与组内另一列的所有时间戳进行比较

Pandas 在时间序列中设定频率

GL pygame无法让缓冲区与vertextPointer和colorPointer一起可靠地工作

三个给定的坐标可以是矩形的点吗

在Python中对分层父/子列表进行排序

使用SciPy进行曲线匹配未能给出正确的匹配

对Numpy函数进行载体化

点到面的Y距离

试图找到Python方法来部分填充numpy数组

如何访问所有文件，例如环境变量

Pandas 都是()，但有一个门槛

在Pandas DataFrame操作中用链接替换'方法的更有效方法

pandas滚动和窗口中有效观察的最大数量

cv2.matchTemplate函数匹配失败

有没有一种ONE—LINER的方法给一个框架的每一行一个由整数和字符串组成的唯一id？

处理具有多个独立头的CSV文件

为什么常规操作不以其就地对应操作为基础？

如何获取Python synsets列表的第一个内容？