Explain Codes LogoExplain Codes Logo

How to test if a string contains one of the substrings in a list, in pandas?

python
pandas
regex
string-matching
Alex KataevbyAlex Kataev·Feb 22, 2025
TLDR

For a quick substring check within a Pandas series, craft a regex pattern from your list, like ['substr1', 'substr2', ...], and employ the str.contains:

import pandas as pd # DataFrame with a column to check df = pd.DataFrame({'column': ['text1', 'text2', ...]}) # List of substrings to search for substrings = ['substr1', 'substr2', ...] # The "Sherlock Holmes" one-liner to detect substrings df['matches'] = df['column'].str.contains('|'.join(substrings))

Crazy to think that Sherlock Holmes could solve cases in one line – quite elementary, my dear Watson! When your substrings have special characters, use re.escape to avoid regex smelling a rat:

import re # CSI team: Escaping special characters in substrings escaped_substrings = [re.escape(substring) for substring in substrings] regex_pattern = '|'.join(escaped_substrings) # Sherlock Holmes strikes again with accurate matches df['matches'] = df['column'].str.contains(regex_pattern)

Detecting substrings: The detective's guide

Matching substrings can feel like a detective mystery. Let's decipher it:

Discarding case sensitivity

Turn your detective code into Hawaii with the case parameter:

# It's always sunny in Philadelphia, but case insensitive in Python df['matches'] = df['column'].str.contains(regex_pattern, case=False)

Interpreting missing values (the missing person's case)

When values go missing (NaN), use the na parameter to decide if they're innocent or guilty:

# Missing values tend to run away. Use 'na' to put them back in the line-up df['matches'] = df['column'].str.contains(regex_pattern, na=False)

Dealing with false positives (The Usual Suspects)

Some words like 'pet' could cause mistaken identities (false positives). To clear their name, use negative lookahead:

# Time to find Keyser Söze among the usual suspects regex_pattern = r'(?<!pet)' + regex_pattern df['matches'] = df['column'].str.contains(regex_pattern)

Pandas detective tricks: From rookie to pro

From the rookie's first day on the beat to the seasoned pro, Pandas presents tools for everyone:

Lambda: For the crafty detective

The crafty detective uses a lambda with apply for those tough-to-crack cases:

# Lambda, Lambda, Lambda! Revenge of the Nerds' detective trick df['matches'] = df['column'].apply(lambda x: any(sub in x for sub in substrings))

Binary storage: No grey areas

For a verdict beyond reasonable doubt, store your results as binary values:

# 1 for guilty, 0 for innocent - welcome to the binary justice system df['matches'] = df['column'].str.contains(regex_pattern).astype(int)

The 're.compile' hook : When regex strikes back

When regex patterns get twisted, re.compile comes to the rescue:

# Pattern coming through! Make way for your compiled regex pattern = re.compile(regex_pattern) df['matches'] = df['column'].str.contains(pattern)

The science of detection

In the world of data, we often find ourselves playing the detective. Luckily, with Python's Pandas library, we have a great forensic toolkit at our disposal:

  • .str.contains(): the fingerprinting kit, finding direct evidence of substrings.
  • '|' operator: the forensic combinator, identifying multiple clues at once.
  • re.escape(): the technical expert, ensuring we don't get tripped up by slippery characters.
  • apply with lambda: the advanced investigator, performing complicated forensic examinations.

Crafting better queries

Boost your detective skills with these methods:

On-point search with exclusions

Sharpen your findings by excluding unwanted suspects:

# "You are not under arrest" - Negative lookaheads to the rescue regex_pattern = r'^(?!.*(unwanted1|unwanted2)).*' df['matches'] = df['column'].str.contains(regex_pattern)

Scaling up with external resources

The external regex library provides both enhanced performance and maneuverability over the built-in re:

import regex # The art of casting, perfected with regex.search() df['matches'] = df['column'].apply(lambda x: bool(regex.search(regex_pattern, x)))

Extractive information

Beyond mere detection, str.extract helps harvest the matching substring:

# Extract - not just for delicious honey! df['extracted_substring'] = df['column'].str.extract(f'({regex_pattern})')