🏠 Data Science Guides Comparing Two DataFrames Oftentimes when you have two DataFrames of similar data you may want to see see where the differences lie between them. DataFrames provides this functionality in a function called compare.
Let's say we have these two (very similar) DataFrames:
df1 = pd. DataFrame( [
{ 'Name' : 'Saul' , 'Favorite Color' : 'Maroon' , 'Show' : 'BCS' } ,
{ 'Name' : 'Walter' , 'Favorite Color' : 'Blue' , 'Show' : 'BB' } ,
{ 'Name' : 'Kim' , 'Favorite Color' : 'Red' , 'Show' : 'BCS' } ,
{ 'Name' : 'Howard' , 'Favorite Color' : 'Green' , 'Show' : 'BCS' }
] )
df2 = pd. DataFrame( [
{ 'Name' : 'Saul' , 'Favorite Color' : 'Maroon' , 'Show' : 'BCS' } ,
{ 'Name' : 'Walter' , 'Favorite Color' : 'Blue' , 'Show' : 'BB' } ,
{ 'Name' : 'Kim' , 'Favorite Color' : 'Red' , 'Show' : 'BCS' } ,
{ 'Name' : 'Jesse' , 'Favorite Color' : 'Maroon' , 'Show' : 'BB' } ,
] ) Our example DataFrames Using Compare If we want to find the different rows, we can simply run this command to compare the two DataFrames
import pandas as pd\ndf1 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2) Run Code
Reset Code Run All to Here Python Output:
Name
Favorite Color
Show
self
other
self
other
self
other
3
Howard
Jesse
Green
Maroon
BCS
BB
By default this will find the different rows. If you want this by column, we can set the parameter of align_axis to 0 (for column-wise operations)
import pandas as pd\ndf1 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, align_axis=0) Run Code
Reset Code Run All to Here Python Output:
Name
Favorite Color
Show
3
self
Howard
Green
BCS
other
Jesse
Maroon
BB
Other Parameters By default, compare is configured to only show you the differences between the two differences, but we can see more by specifying some additional parameters in the function call
If you specify the keep_shape parameter to True, we can see everything in the DataFrame, with NaNs populated for matches and values present for differences
import pandas as pd\ndf1 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, keep_shape=True) Run Code
Reset Code Run All to Here Python Output:
Name
Favorite Color
Show
self
other
self
other
self
other
0
NaN
NaN
NaN
NaN
NaN
NaN
1
NaN
NaN
NaN
NaN
NaN
NaN
2
NaN
NaN
NaN
NaN
NaN
NaN
3
Howard
Jesse
Green
Maroon
BCS
BB
We can also add the keep_equal parameter to populate matches as well, which may be useful for visualization or when you want to combine DataFrames using sub-columns
import pandas as pd\ndf1 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, keep_shape=True, keep_equal=True) Run Code
Reset Code Run All to Here Python Output:
Name
Favorite Color
Show
self
other
self
other
self
other
0
Saul
Saul
Maroon
Maroon
BCS
BCS
1
Walter
Walter
Blue
Blue
BB
BB
2
Kim
Kim
Red
Red
BCS
BCS
3
Howard
Jesse
Green
Maroon
BCS
BB
You can read more about compare in the Pandas documentation here: pandas documentation