Home >Backend Development >Python Tutorial >How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?

How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?

Linda Hamilton
Linda HamiltonOriginal
2024-11-07 05:34:03513browse

How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?

How to Eliminate Duplicates by Columns, Retaining Rows with the Highest Values

When confronted with duplicate values in one column of a DataFrame, it becomes necessary to implement strategies to eliminate them. One approach is to preserve only the rows with the highest values in another column.

Consider this example DataFrame:

A B
1 10
1 20
2 30
2 40
3 10

The goal is to transform this DataFrame into:

A B
1 20
2 40
3 10

One method involves sorting the DataFrame before eliminating duplicates:

df.sort_values(by='B', ascending=False).drop_duplicates(subset='A')

However, for more complex scenarios involving multiple columns and nuanced sorting requirements, groupby can be employed. The code below demonstrates this approach:

df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])

This solution groups the rows by column 'A' and yields the row with the maximum value in column 'B' for each group.

The above is the detailed content of How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn