Explain Codes LogoExplain Codes Logo

Split Strings into Words With Multiple Word Boundary Delimiters

python
string-splitting
regex-patterns
string-processing
Alex KataevbyAlex Kataev·Sep 6, 2024
TLDR

Python's re.split() function from the re module can split strings with different delimiters. Construct a regex pattern that includes all required delimiters enclosed in [ ] and separated by a comma. Here's the quick code tip:

import re words = re.split(r'[ ,;]+', 'Split,this;string')

Key Concept: The r'[ ,;]+' pattern conveys to re.split() to split on spaces, commas, or semicolons one or more times.

Output:

['Split', 'this', 'string']

An Enhanced Guide to String Splitting

Preserving Delimiters Upon Splitting

To retain delimiters while splitting the string for context or advanced processing, employ a capturing group () around the delimiter regex:

import re parts = re.split(r'(;)', 'Keep; the; delimiters') # Bet you didn't see that coming!

Now your list will have delimiters as independent elements.

Dealing with Empty Stings, the Unfortunate Byproduct

Post-split, you might find empty strings if your pattern aligns with the edges of the string. Now these "empty feelings" as we call them, can be filtered out using either list comprehensions or filter():

words = [word for word in words if word] # With list comprehension # or using filter words = list(filter(None, words)) # Goodbye emptiness!

The Art of Using Advanced Delimiter Patterns

For complex delimiters, like a concoction of punctuation, spaces, or special characters, conjure a more elaborate regex:

words = re.split(r'[ ,\-!?]+', 'Handle, multiple-delimiters! Right?') # So many options, such versatility!

Here we're treating commas, spaces, hyphens, exclamation points, and question marks as delimiters for splitting.

Unicode Characters and Contractions: The Tricky Fellows

Python's regex functions understand Unicode. This means you can safely split strings containing those fancy non-ASCII characters:

words = re.split(r'\W+', u'Sträßle überschwänglich') # Yes, Python is in-tune with the multicultural world.

To split strings while protecting contractions like "don't", apply a pattern like r"[\w']+":

words = re.findall(r"[\w']+", "Don't forget contractions") # Because contractions matter too!

Efficiency: The Art of Python Zen

Sometimes, using regular expressions may feel like bringing a bazooka to a knife fight. In these cases, built-in string methods should serve you well:

# For simple space-based splitting words = 'This is a sentence.'.split() # Pssst! Simple is sometimes better.

Use str.replace() before str.split() to remove unnecessary punctuation, aptly:

import string cleaned = 'Hello, World!'.translate(str.maketrans('', '', string.punctuation)) words = cleaned.split() # A quick cleanup, and voila!

String Splitting Master Class

Fixed Number of Splits: The Power of Prudence

By using a maxsplit parameter, limit the number of splits if you only want to separate bits of a string:

words = re.split(r'\W+', 'Split only, this. Sentence', maxsplit=1) # Because you get to choose where to draw the line!

Working With Punctuation: Be Picky

Taking advantage of the string.punctuation constant, you can reference all punctuation characters without painstakingly typing them out:

from string import punctuation filtered_str = ''.join(ch for ch in 'Remove, punctuation!' if ch not in punctuation) # We love shortcuts, don't we?

Efficiency Tip: Precompile Your Regex Patterns

If you're handling the same pattern frequently, precompiling your regex pattern is a wise choice:

pattern = re.compile(r'[ ,\-!?]+') words = pattern.split('Use, this-pattern!') # Efficiency is key, remember?

References