Explain Codes LogoExplain Codes Logo

Compute list difference

python
list-comprehension
set-operations
performance-optimization
Anton ShumikhinbyAnton Shumikhin·Oct 17, 2024
TLDR

The quick answer for list difference is to use a list comprehension with not in to maintain the order:

diff = [x for x in list_a if x not in list_b] # What did x do to not get invited to the party, right?

Or if the order does not matter, set subtraction will get the job done:

diff = list(set(list_a) - set(list_b)) # Surprise removal test for lists.

Both yield elements in list_a not in list_b.

No-loss computation

When the list order matters, convert list_b to a set and execute a list comprehension:

set_b = set(list_b) diff = [x for x in list_a if x not in set_b] # Snapping away B elements from A, Thanos style.

This retains the order of elements from list_a whilst making the most out of faster set lookup.

Retaining repeats

To keep duplicates, use a collections.Counter for complicated differences:

from collections import Counter count_a = Counter(list_a) count_b = Counter(list_b) diff = list((count_a - count_b).elements()) # Subtracting like math class, but more fun

This method subtracts frequencies, preserving order and count of remaining items.

Advanced methods with difflib

In complex circumstances, where standard list operations are insufficient, use difflib.SequenceMatcher:

from difflib import SequenceMatcher sm = SequenceMatcher(None, list_a, list_b) diff = [list_a[i] for i, j, n in sm.get_opcodes() if i == 'delete'] # Time-traveling to alter history

difflib provides not just differences, but also contextual changes between lists, ideal for non-standard diff computations.

Large scale performance

Remember time complexity with big lists: set operations are O(n), but with list comprehensions, they become O(n*m). This makes it inefficient for large datasets.

Try NumPy for dynamic data and for vectorized operations:

import numpy as np array_a = np.array(list_a) array_b = np.array(list_b) diff = np.setdiff1d(array_a, array_b) # numpy: not just for scientists!