您可以使用compare.column_stats
来实现这一点:一个包含每列相关信息的字典的列表.
import pandas as pd
import datacompy
data = {'id': [1, 2],
'col1': [1, 2]}
df1 = pd.DataFrame(data)
data2 = {'id': [1, 2],
'col1': ['A', 'B']}
df2 = pd.DataFrame(data2)
compare = datacompy.Compare(df1, df2, join_columns=['id'])
print(compare.report())
# ...
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 col1 int64 object 2 0.0 0
Sample Rows with Unequal Values
-------------------------------
id col1 (df1) col1 (df2)
0 2 2 B
1 1 1 A
compare.column_stats
[{'column': 'id',
'match_column': '',
'match_cnt': 2,
'unequal_cnt': 0,
'dtype1': 'int64',
'dtype2': 'int64',
'all_match': True,
'max_diff': 0.0,
'null_diff': 0},
{'column': 'col1',
'match_column': 'col1_match',
'match_cnt': 0,
'unequal_cnt': 2,
'dtype1': 'int64',
'dtype2': 'object',
'all_match': False,
'max_diff': 0.0,
'null_diff': 0}]
- 使用列表理解获取所有列名,其中
unequal_cnt != 0
:
unmatched_columns = [stat['column'] for stat in compare.column_stats
if stat['unequal_cnt'] != 0]
unmatched_columns
# ['col1']
也可以方便地创建一个带有pd.DataFrame
的df
,并根据需要进行过滤:
column_stats = pd.DataFrame(compare.column_stats)
column_stats
column match_column match_cnt unequal_cnt dtype1 dtype2 all_match \
0 id 2 0 int64 int64 True
1 col1 col1_match 0 2 int64 object False
max_diff null_diff
0 0.0 0
1 0.0 0
# e.g. column_stats[column_stats['unequal_cnt'].ne(0)]