Explain Codes LogoExplain Codes Logo

How do I remove duplicates from a list, while preserving order?

python
dataframe
pandas
best-practices
Anton ShumikhinbyAnton Shumikhin·Oct 19, 2024
TLDR

To eliminate duplicates from a list while keeping the order intact, the following one-liner is quite handy:

remove_dupes = lambda lst: list(dict.fromkeys(lst)) unique_list = remove_dupes([1, 2, 2, 3, 4, 4, 5]) # Behold [1, 2, 3, 4, 5], the purified list!

The dict.fromkeys() method generates an order-preserved dictionary, which is then converted back to a list.

Diving into several methods

Implementing set and dict from Python 3.7+

Starting from Python 3.7, the dict object retains the insertion order. Hence, we can use it for eliminating duplicates:

unique_items = list(dict.fromkeys(your_list)) # "Dict" is my favorite four-letter word

In earlier versions of Python (<=3.5), collections.OrderedDict can be employed to achieve the same result:

from collections import OrderedDict unique_items = list(OrderedDict.fromkeys(your_list)) # A lifeline for "version-challenged" programmers

Both approaches are simple, Pythonic, and efficient as they don't require any external dependencies.

Using list comprehension with set

A set and list comprehension can be combined for keeping the order with O(1) complexity for membership checks:

seen = set() unique_list = [x for x in your_list if x not in seen and not seen.add(x)] # Seen.add() or not seen.add(), that's the question

In the above statement, the "or" operator enables efficient set updates.

Applying lazy techniques for complex cases

For non-hashable or complicated items, more_itertools’ unique_everseen does a commendable job:

from more_itertools import unique_everseen unique_list = list(unique_everseen(your_list)) # Because sometimes, L.A.Z.Y is the right approach

This code builds a lazy iterator that eliminates duplicates on-demand, useful for huge datasets.

Leveraging Pandas for large data scenarios

Pandas provides an efficient, vectorized approach suitable for large lists:

import pandas as pd unique_list = pd.Series(your_list).drop_duplicates().tolist() # Panda-monium for data wranglers

This can be especially handy for data wrangling tasks due to its versatility and performance.

Optimizing performance and clarity

Picking the right method

Though one-liners might seem attractive, always prioritize readability and the context of usage:

  • For simple lists, use dict comprehension or OrderedDict for better readability.
  • For large datasets, Pandas can offer fast operations with optimized functions.
  • For non-hashable items, unique_everseen from more_itertools ensures lazy checks.

Tips for best practices

  • Use the built-in functions and libraries as much as possible to minimize dependencies.
  • Clarity should be valued over complex logic unless code performance demands otherwise.
  • Benchmark your code to find the best-suited solution for your use case.

Some notes on performance

  • Starting from Python 3.7, a plain dictionary is equivalently fast as an OrderedDict for retaining order.
  • List comprehension can be faster than traditional functions as it avoids function call overhead.
  • Using logical operators in list comprehensions can further optimize set updates.