Home >Backend Development >Python Tutorial >How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?
How to Eliminate Duplicates by Columns, Retaining Rows with the Highest Values
When confronted with duplicate values in one column of a DataFrame, it becomes necessary to implement strategies to eliminate them. One approach is to preserve only the rows with the highest values in another column.
Consider this example DataFrame:
A B 1 10 1 20 2 30 2 40 3 10
The goal is to transform this DataFrame into:
A B 1 20 2 40 3 10
One method involves sorting the DataFrame before eliminating duplicates:
df.sort_values(by='B', ascending=False).drop_duplicates(subset='A')
However, for more complex scenarios involving multiple columns and nuanced sorting requirements, groupby can be employed. The code below demonstrates this approach:
df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
This solution groups the rows by column 'A' and yields the row with the maximum value in column 'B' for each group.
The above is the detailed content of How to Eliminate Duplicate Rows in a DataFrame, Keeping Only the Rows with the Highest Values in a Specific Column?. For more information, please follow other related articles on the PHP Chinese website!