Comparing Two DataFrames


Oftentimes when you have two DataFrames of similar data you may want to see see where the differences lie between them. DataFrames provides this functionality in a function called compare.

Let's say we have these two (very similar) DataFrames:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}
])
df2 = pd.DataFrame([
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},
])
Our example DataFrames

Using Compare

If we want to find the different rows, we can simply run this command to compare the two DataFrames

Reset Code Run All to Here Python Output:
Name Favorite Color Show
self other self other self other
3 Howard Jesse Green Maroon BCS BB

By default this will find the different rows. If you want this by column, we can set the parameter of align_axis to 0 (for column-wise operations)

Reset Code Run All to Here Python Output:
Name Favorite Color Show
3 self Howard Green BCS
other Jesse Maroon BB

Other Parameters

By default, compare is configured to only show you the differences between the two differences, but we can see more by specifying some additional parameters in the function call

If you specify the keep_shape parameter to True, we can see everything in the DataFrame, with NaNs populated for matches and values present for differences

Reset Code Run All to Here Python Output:
Name Favorite Color Show
self other self other self other
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 Howard Jesse Green Maroon BCS BB

We can also add the keep_equal parameter to populate matches as well, which may be useful for visualization or when you want to combine DataFrames using sub-columns

Reset Code Run All to Here Python Output:
Name Favorite Color Show
self other self other self other
0 Saul Saul Maroon Maroon BCS BCS
1 Walter Walter Blue Blue BB BB
2 Kim Kim Red Red BCS BCS
3 Howard Jesse Green Maroon BCS BB

You can read more about compare in the Pandas documentation here: pandas documentation