Explain Codes LogoExplain Codes Logo

How do I find the duplicates in a list and create another list with them?

python
prompt-engineering
best-practices
dataframe
Alex KataevbyAlex KataevΒ·Feb 12, 2025
⚑TLDR

Use list comprehension and Counter from the collections module to quickly find duplicates in a Python list. Here's how you can generate a list of items that appear more than once:

from collections import Counter # Original list potentially packed with duplicates original_list = ['🍎', '🍏', '🍊', '🍏', '🍎'] # An apple a day keeps the doctor away, but 2 apples a day gives us duplicates! duplicates = [item for item, count in Counter(original_list).items() if count > 1] # Output: ['🍎', '🍏'] print(duplicates)

Now the duplicates list stores the repeated items.

Getting the job done efficiently

For large datasets, efficiency is a big deal. collections.Counter cuts to the chase by operating in O(n) time, unlike the tortoise-like .count() method. When dealing with non-hashable items like lists or dictionaries, we'll need a step-by-step (or quadratic) custom solution:

# A list of dictionaries that may include duplicates original_list = [{'apple': 1}, {'banana': 2}, {'apple': 1}] seen = [] duplicates = [] # Each dictionary we've seen before is a duplicate for dict_item in original_list: if dict_item in seen: duplicates.append(dict_item) else: seen.append(dict_item)

Keeping it scalable

It's important to factor in the nature of your data. Ordered data? Generator-based methods could be your best bet:

from more_itertools import unique_everseen # Identifying duplicates while retaining order unique_everseen(duplicates(original_list))

An efficient and readable way to handle large lists is using sets and avoiding the use of list.count() in a loop which leads to an O(nΒ²) complexity. Here's how you do it:

unique_items = set() duplicates = [] # A set has no duplicates, so it can't take a joke for item in original_list: if item in unique_items: duplicates.append(item) else: unique_items.add(item)

This method passes through the list just once, constantly running a check against a unique set of items.

Additional Pythonic solutions

The Counter method is handy, sure, but why stop there?

Scoring with the champion chain

Score some quick wins with chain for speedy operations:

from iteration_utilities import chain # Chaining "unique_everseen" and "duplicates" for that Formula 1 pitstop-level efficiency chain_unique_duplicates = list(chain(unique_everseen, duplicates(original_list)))

Guard against non-hashables

Dealing with non-hashable elements like dictionaries? Nothing a good loop strategy can’t fix:

seen = [] duplicates = [] # Peek-a-boo with each non-hashable dictionary for dict_item in original_list: if dict_item not in seen: seen.append(dict_item) else: duplicates.append(dict_item)

Get empowered with Pandas

Because who doesn't love Pandas? This master tool is the go-to option for dealing with varied data types and intricate data structures. It offers robust duplicate handling with tons of added features.

import pandas as pd # Original list on a pandas-powered diet series = pd.Series(original_list) # Et voilΓ ! Spot the duplicates like a pro duplicates_series = series[series.duplicated()]