Comparing Two DataFrames

Oftentimes when you have two DataFrames of similar data you may want to see see where the differences lie between them. DataFrames provides this functionality in a function called compare.

Let's say we have these two (very similar) DataFrames:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}
])
df2 = pd.DataFrame([
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},
])

Our example DataFrames

Using Compare

If we want to find the different rows, we can simply run this command to compare the two DataFrames

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2)

Reset Code Run All to Here Python Output:


  
    
      
      Name
      Favorite Color
      Show
    
    
      
      self
      other
      self
      other
      self
      other
    
  
  
    
      3
      Howard
      Jesse
      Green
      Maroon
      BCS
      BB

	Name	Favorite Color	Show
3	Howard	Jesse	Green	Maroon	BCS	BB

By default this will find the different rows. If you want this by column, we can set the parameter of align_axis to 0 (for column-wise operations)

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, align_axis=0)

Reset Code Run All to Here Python Output:


  
    
      
      
      Name
      Favorite Color
      Show
    
  
  
    
      3
      self
      Howard
      Green
      BCS
    
    
      other
      Jesse
      Maroon
      BB

		Name	Favorite Color	Show
3	self	Howard	Green	BCS
other	Jesse	Maroon	BB

Other Parameters

By default, compare is configured to only show you the differences between the two differences, but we can see more by specifying some additional parameters in the function call

If you specify the keep_shape parameter to True, we can see everything in the DataFrame, with NaNs populated for matches and values present for differences

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, keep_shape=True)

Reset Code Run All to Here Python Output:


  
    
      
      Name
      Favorite Color
      Show
    
    
      
      self
      other
      self
      other
      self
      other
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      Howard
      Jesse
      Green
      Maroon
      BCS
      BB

	Name	Favorite Color	Show
0	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN
3	Howard	Jesse	Green	Maroon	BCS	BB

We can also add the keep_equal parameter to populate matches as well, which may be useful for visualization or when you want to combine DataFrames using sub-columns

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}\n])\ndf2 = pd.DataFrame([\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},\n])\ndf1.compare(df2, keep_shape=True, keep_equal=True)

Reset Code Run All to Here Python Output:


  
    
      
      Name
      Favorite Color
      Show
    
    
      
      self
      other
      self
      other
      self
      other
    
  
  
    
      0
      Saul
      Saul
      Maroon
      Maroon
      BCS
      BCS
    
    
      1
      Walter
      Walter
      Blue
      Blue
      BB
      BB
    
    
      2
      Kim
      Kim
      Red
      Red
      BCS
      BCS
    
    
      3
      Howard
      Jesse
      Green
      Maroon
      BCS
      BB

	Name	Favorite Color	Show
0	Saul	Saul	Maroon	Maroon	BCS	BCS
1	Walter	Walter	Blue	Blue	BB	BB
2	Kim	Kim	Red	Red	BCS	BCS
3	Howard	Jesse	Green	Maroon	BCS	BB

You can read more about compare in the Pandas documentation here: pandas documentation