Home >Backend Development >Python Tutorial >How Can I Efficiently Find the Differences Between Two Pandas DataFrames?

How Can I Efficiently Find the Differences Between Two Pandas DataFrames?

Barbara Streisand
Barbara StreisandOriginal
2024-11-25 20:21:12660browse

How Can I Efficiently Find the Differences Between Two Pandas DataFrames?

Finding Differences Between Two Data Frames

When working with pandas data frames, it's common to have situations where you need to compare two data frames and extract the differences between them. For instance, you may have a base data frame (df1) and a subset data frame (df2) and want to create a new data frame (df3) that contains all the rows and columns in df1 that are not present in df2.

Using drop_duplicates

The most direct method to achieve this is to use the drop_duplicates function. By concatenating df1 and df2, and then applying drop_duplicates to the concatenated data frame, you can eliminate duplicate rows and obtain a data frame containing only unique rows.

pd.concat([df1, df2]).drop_duplicates(keep=False)

Handling Duplicates

However, if either df1 or df2 itself contains duplicate rows, the drop_duplicates approach can result in erroneous output. To address this, you can use alternative methods that preserve original duplicates:

Method 1: Using isin with tuple

This method involves converting each row of the data frames into a tuple and checking if the tuples are present in both data frames. Rows that don't match any tuple in the other data frame will be included in the difference data frame.

df1[~df1.apply(tuple, 1).isin(df2.apply(tuple, 1))]

Method 2: Merge with indicator

Another approach involves merging df1 and df2 using an indicator column. By specifying how='left' in the merge, rows from df1 that don't have corresponding rows in df2 will be assigned a _merge column value of left_only. You can then filter out rows based on this indicator column to obtain the difference data frame.

df1.merge(df2, indicator=True, how='left').loc[lambda x: x['_merge'] != 'both']

The above is the detailed content of How Can I Efficiently Find the Differences Between Two Pandas DataFrames?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn