Drop all duplicate rows across multiple columns in Python Pandas

python

pandas

dataframe

drop_duplicates

byAlex Kataev·Dec 31, 2024

Easily remove duplicates in your Pandas DataFrame by calling the drop_duplicates() function. Utilize the subset argument to focus on specific columns, or leave it undefined to consider all columns. Make use of the keep argument to specify the type of duplicates to retain, e.g., keep='first' will retain the first occurrence. Check out the demonstration below:

import pandas as pd

# Our DataFrame
df = pd.DataFrame({'A': [1, 1, 2], 'B': [2, 2, 2], 'C': [3, 3, 3]})

# Zapping duplicate rows
df_unique = df.drop_duplicates()

print(df_unique)

Using drop_duplicates(), any following duplicate rows are annihilated, retaining only the first unique combination within the DataFrame.

The nitty-gritty of drop_duplicates

Targeting specific columns when eliminating duplicates? The subset argument is your partner in trimming:

# Only remove duplicates based on columns 'A' and 'C'.
df_unique = df.drop_duplicates(subset=['A', 'C'])  # Subset, your new best friend!

Desire to eradicate all duplicates, without any remains? Just say keep=False:

# Drop 'em all!
df_no_duplicates = df.drop_duplicates(subset=['A', 'C'], keep=False)  # With keep=False, none shall pass!

Modifying the source DataFrame: The inplace=True flag lets you enact changes directly on the original DataFrame:

# Modify the original DataFrame (no turning back!)
df.drop_duplicates(subset=['A', 'C'], keep=False, inplace=True)  # With great power comes great responsibility!

Dealing with the sneaky duplicates

Sometimes, duplicates wear disguises. They may not appear identical due to subtle differences or inconsistencies. To unmask such duplicates, preprocess your data—trim spaces, standardize case, or employ text similarity techniques—prior to calling drop_duplicates().

drop_duplicates vs SQL

The drop_duplicates function is akin to the SQL SELECT DISTINCT *. Both tools are crafted to extract unique records, but the keep and subset parameters give drop_duplicates an edge in terms of flexibility.

Other ways to skin the cat

There's more than one way to remove duplicates in Pandas. Exploring alternatives keeps coding interesting and your skills sharp!

Group by and filter

When deduplication requires complex logic, the dynamic duo of groupby and filter saves the day. For instance, to drop rows with duplicated values in column 'A', regardless of other columns:

# The (group)by day, (filter) by night hero we all need
df_filtered = df.groupby('A').filter(lambda x: len(x) == 1)

Combining sort_values with drop_duplicates

To ensure the most relevant data stays after deduplication, wield sort_values before calling drop_duplicates():

# Sort before dropping - an arranged marriage of functions
df_sorted_uniques = df.sort_values(by='A', ascending=True).drop_duplicates(subset=['A'], keep='last')

Potential pitfalls and their antidotes

Heads up against operating with deprecated parameters like take_last and cols. Embrace the current API to dodge compatibility snares. When in doubt, refer to the latest Pandas documentation for guidance.

explain-codes / Python / Drop all duplicate rows across multiple columns in Python Pandas

Linked

Remove duplicates by columns A, keeping the row with the highest value in column B



What does `ValueError: cannot reindex from a duplicate axis` mean?



Show distinct column values in pyspark dataframe



Remove duplicate dict in a list in Python



How to group dataframe rows into list in pandas groupby

