Home > Article > Backend Development > How to Efficiently Find the Difference Between Two Pandas DataFrames?
How to Find the Difference Between Two Data Frames
Often when working with data, we may have two data frames that contain overlapping information but also contain unique rows or columns. To obtain a data frame that includes only the rows and columns present in one data frame and not the other, we need to perform a data frame difference operation.
To achieve this, we can utilize the drop_duplicates function with the keep=False parameter, which effectively removes any duplicate rows from a concatenated data frame:
pd.concat([df1,df2]).drop_duplicates(keep=False)
However, this method assumes that both data frames are free of duplicates. If duplicates exist in the original data frames, the above method will inadvertently remove them. To handle this scenario, we can employ one of two alternative approaches:
Method 1: Using isin with Tuples
This method involves creating a tuple for each row in the data frame and then using isin to check if a tuple from df1 exists in df2. The rows that exist only in df1 are retained:
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Method 2: Merge with Indicator
By merging the two data frames with the indicator parameter as True, we can create a new column indicating which rows are unique to either df1 or df2. The rows that are unique to df1 can then be filtered out by selecting rows where the _merge column value is left_only:
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
By implementing these techniques, you can efficiently determine the difference between two data frames, ensuring that you have a data frame containing only the unique information from each data frame.
The above is the detailed content of How to Efficiently Find the Difference Between Two Pandas DataFrames?. For more information, please follow other related articles on the PHP Chinese website!