Explain Codes LogoExplain Codes Logo

Substitute multiple whitespace with single whitespace in Python

python
functions
performance
best-practices
Alex KataevbyAlex Kataev·Oct 7, 2024
TLDR

You can collapse multiple whitespaces to one in Python using the re.sub() function from the re module. The pattern r'\s+' targets clusters of whitespace, which it then replaces with a single space. See the code snippet below for a real quick check:

import re # Imagine Bob Ross painting a happy little whitespace here clean_text = re.sub(r'\s+', ' ', "This is an example. Got it?") print(clean_text) # See? "This is an example. Got it?"

This code takes away the extra spaces, making the string look neat and well-formatted.

Alternatives to regular expressions

Divide and conquer (Splitting and joining)

If regex gives you a headache, Python's split() and join() operations make a solid team. Together they can defeat the whitespace monster just as well. This combined function works by splitting the string into words, then assembling them back together with a space.

# No regex were harmed in creating this example. text = "Too many spaces." clean_text = ' '.join(text.split()) print(clean_text) # Now it says: "Too many spaces."

Handling the outskirt whiteness (Strip operation)

If you want to get rid of whitespace around the string, you can call the strip() method. This doesn't affect the spaces between the words.

text = " Middle. " clean_text = text.strip() print(clean_text) # And the result is: "Middle."

Valar Morghulis (all leading and trailing whitespace must die)!

Dealing with different types of whitespace

Beware of the Control Characters

Whitespace is like an iceberg—there’s a lot more below the surface (hello \t, \n, etc.). So when using re.sub(), remember it views these control characters as whitespaces as well which could generate unexpected outputs.

Focusing on ASCII whitespace

The re.ASCII flag comes in handy when you need to handle only ASCII spaces and leave other whitespace ghosts (like unicode ones) untouched. If your text is a melting pot of encoding standards, this could be helpful.

# Now we're flagging! It's ASCII-nating. clean_text = re.sub(r'\s+', ' ', "Use the space, Luke.", flags=re.ASCII) print(clean_text) # Outputs: "Use the space, Luke."

Considering performance

Comparing methods

Regex isn't the only game in town. Non-regex methods like Alex Martelli's split/join method often out-paces the competition. Make different methods race against each other for your specific scenario. After all, performance is a marathon, not a sprint! For a detailed comparison, you can refer this link.

Internal optimisations

Python keeps two power-ups in the form of _RE_COMBINE_WHITESPACE and _RE_STRIP_WHITESPACE. They come from the re module and are highly optimized for whitespace operations. While they're not officially part of the API, you can still use them for an extra boost.

from re import _RE_COMBINE_WHITESPACE, _RE_STRIP_WHITESPACE # There's always an "_under_score" (_not really) text = "Whitespace everywhere." clean_text = _RE_COMBINE_WHITESPACE.sub(' ', text).strip() print(clean_text) # Outputs: "Whitespace everywhere."

Useful tips, tricks, and caveats

Escaping literal spaces

Remember, regex treats a lot of symbols as special characters. If you want to include an actual space in your pattern, don't space out! Use a backslash (\ ) to escape it.

Immutable strings and memory

Keep in mind that strings in Python are immutable. This means any string operation creates a new string, which could lead to high memory usage. It's a game of memory mahjong, choose your moves wisely!

Comprehensive string cleaning

If you're in a string cleaning spree, a more complex regex pattern or a chain of string methods could help you capture and clean a wider range of unwanted characters.