Split Strings into Words With Multiple Word Boundary Delimiters
Python's re.split()
function from the re
module can split strings with different delimiters. Construct a regex pattern that includes all required delimiters enclosed in [ ]
and separated by a comma. Here's the quick code tip:
Key Concept: The r'[ ,;]+'
pattern conveys to re.split()
to split on spaces, commas, or semicolons one or more times.
Output:
['Split', 'this', 'string']
An Enhanced Guide to String Splitting
Preserving Delimiters Upon Splitting
To retain delimiters while splitting the string for context or advanced processing, employ a capturing group ()
around the delimiter regex:
Now your list will have delimiters as independent elements.
Dealing with Empty Stings, the Unfortunate Byproduct
Post-split, you might find empty strings if your pattern aligns with the edges of the string. Now these "empty feelings" as we call them, can be filtered out using either list comprehensions or filter()
:
The Art of Using Advanced Delimiter Patterns
For complex delimiters, like a concoction of punctuation, spaces, or special characters, conjure a more elaborate regex:
Here we're treating commas, spaces, hyphens, exclamation points, and question marks as delimiters for splitting.
Unicode Characters and Contractions: The Tricky Fellows
Python's regex functions understand Unicode. This means you can safely split strings containing those fancy non-ASCII characters:
To split strings while protecting contractions like "don't", apply a pattern like r"[\w']+"
:
Efficiency: The Art of Python Zen
Sometimes, using regular expressions may feel like bringing a bazooka to a knife fight. In these cases, built-in string methods should serve you well:
Use str.replace()
before str.split()
to remove unnecessary punctuation, aptly:
String Splitting Master Class
Fixed Number of Splits: The Power of Prudence
By using a maxsplit
parameter, limit the number of splits if you only want to separate bits of a string:
Working With Punctuation: Be Picky
Taking advantage of the string.punctuation
constant, you can reference all punctuation characters without painstakingly typing them out:
Efficiency Tip: Precompile Your Regex Patterns
If you're handling the same pattern frequently, precompiling your regex pattern is a wise choice:
References
Was this article helpful?