Remove duplicates by columns A, keeping the row with the highest value in column B

python

dataframe

groupby

drop_duplicates

byAlex Kataev·Mar 4, 2025

Let's get right to the point. Use pandas to remove duplicates in column A, retaining the row with the maximum value in column B:

import pandas as pd

# Assuming 'df' is your DataFrame, and 'A'/'B' are your columns
result = df.loc[df.groupby('A')['B'].idxmax()]

This is the "fast-food" version: quick, satisfying, and gets the job done. But sometimes you fancy a multiple-course dinner, right? Let's elaborate.

Deeper Dive: Efficient Practices and Alternatives

Sorting Prior to Dropping Duplicates

You can "pre-heat the oven" by sorting your DataFrame before dropping any duplicates.

sorted_df = df.sort_values(by='B', ascending=False) # Sorting in Descending Order
clean_df = sorted_df.drop_duplicates(subset='A', keep='first') # "First come, first serve" strategy

The sort_values method comes in handy if your data is otherwise unsorted or unordered.

Grouping without any Sorting

If you're "allergic" to sorting, you're in luck. Here's a groupby method without any sorting:

grouped_df = df.groupby('A', as_index=False).max()

This line is essentially saying, "Group by column A and take the max from column B. Easy-peasy, lemon squeezy!". But, it may not maintain other original row values - kind of like how my mom's recipes never quite taste like grandma's.

Championing the 'loc' Method

loc is your best friend when preserving all original values in the row with maximum B for each A.

indexes = df.groupby('A')['B'].idxmax()
result = df.loc[indexes]

Like using a GPS to find the best pizza place in town.

Removing the Need for 'apply'

Given apply can be a bit sluggish, vectorized operations usually run faster than a cheetah on caffeine:

df['rank'] = df.groupby('A')['B'].rank(method='first', ascending=False) # Create an Olympic Ranking
result = df[df['rank'] == 1].drop('rank', axis=1) # Goldberg strategy! Only the best survives!

Tailored Usage and Special Cases

Handling Equally Fast Racers

Consider where two racers in column B have the same best time:

result = df.sort_values('B', ascending=False).drop_duplicates(subset='A')

In the event of a tie in B, the first racer gets the glory. The question remains: who gets the champagne shower?

Deciding who to Keep in a Tie

The keep parameter in drop_duplicates acts as the referee during a tie:

result = df.drop_duplicates(subset='A', keep='last') # "I'mma let you finish, but..."

When Things get Complex: Lambda

If you're dealing with a special-case scenario, trust lambda to have your back:

result = df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax()])

This method is like a Swiss Army knife, very versatile!

explain-codes / Python / Remove duplicates by columns A, keeping the row with the highest value in column B

Linked

How to group dataframe rows into list in pandas groupby



Drop all duplicate rows across multiple columns in Python Pandas



Pandas get topmost n records within each group



Pandas dataframe get first row of each group



Show distinct column values in pyspark dataframe



How to sort a DataFrame in python pandas by two or more columns?



Pandas groupby, then sort within groups



Deeper Dive: Efficient Practices and Alternatives Tailored Usage and Special Cases