Explain Codes LogoExplain Codes Logo

How can I find all matches to a regular expression in Python?

python
regex
finditer
match-objects
Nikita BarsukovbyNikita Barsukov·Dec 29, 2024
TLDR

Harness the power of re.findall() to retrieve all occurrences of a regex pattern in a string, in a distilled and streamlined fashion:

import re matches = re.findall(r"pattern", "search_string")

For instance, if we want words that kick off with an 'S':

matches = re.findall(r"\bS\w+", "The rain in Spain")

The output we get is: ['Spain']. A no-nonsense, effective approach.

Using finditer for better performance

When wrestling with substantial text bodies or requiring more match details, re.finditer() stands as an efficient alternative. It returns an iterator yielding MatchObject instances instead of a list:

matches = re.finditer(r"\bS\w+", "The rain in Spain") for match in matches: print(match.group()) # Prints 'Spain'; finditer is the heavyweight champion.

Squeezing information from MatchObjects

Each MatchObject from re.finditer() is a goldmine of details about each match. You can extract these nuggets of information through methods such as .group(), .start(), .end(), and .groups(). Behold the power of .group():

matches = [m.group() for m in re.finditer(r"(\bS\w+)", "The rain in Spain")]

Findall and capturing groups

If your regular expression incorporates groups, re.findall() brings home just the groups. Given several groups, you receive a list of tuples:

matches = re.findall(r"(\bT\w+)\s(\bS\w+)", "The rain in Spain stays mainly in the plain")

This yields group party pairs: [('The', 'Spain'), ('The', 'stays')].

Beware of regex's greedy nature

Regex can get a bit too eager sometimes. Domesticate its greedy nature with a non-greedy match ? to avoid any surprising findings:

# Greedy match matches = re.findall(r"<.*>", "<tag>content</tag>") # Spits: ['<tag>content</tag>'] # Non-greedy match matches = re.findall(r"<.*?>", "<tag>content</tag>") # Spits: ['<tag>', '</tag>'] # Greediness cured!

Flags: the secret spices of regex

Meetings with flags like re.IGNORECASE can bring a radical change of attitude. Think of them as secret spices for flavorful results:

matches = re.findall(r"spain", "The rain in Spain", re.IGNORECASE) # Feeds you: ['Spain']

Dealing with the Unicode dragon

Taming the dragon of Unicode matching is no child's play. Equip yourself with the flag re.UNICODE to secure your regex from any Unicode character inconsistencies:

matches = re.findall(r"\w+", "café", re.UNICODE) # Lets you enjoy: ['café'] # A hot cup of ‘café’!