Explain Codes LogoExplain Codes Logo

Removing all non-numeric characters from string in Python

python
regex
string-manipulation
performance
Anton ShumikhinbyAnton Shumikhin·Mar 4, 2025
TLDR

Quickly filter out non-numeric characters using re.sub() within the re module:

import re cleaned = re.sub(r'\D', '', 'example123')

This command extracts '123' from cleaned, leaving behind only digit characters (\D specifies the non-digit parts).

Breaking down regex for digit removal

Regex is considered a battle-tested tool in a programmer's toolkit for managing strings. Understanding the syntax is crucial. The r'\D' pattern efficiently catches all non-numeric characters. If you need to keep decimal points, use r'[^\d.]'. This retains floating point numbers unharmed during the removal process, which becomes handy when dealing with decimal numbers.

Clarity with filtering

The functions filter() and str.isdigit() can be employed for a more legible process to weed out non-numeric characters. They offer efficiency and compatibility with Python 2 and 3:

numeric_string = ''.join(filter(str.isdigit, 'hello4me123')) # because letters don't count🙃

Dealing with floats and negative numbers

Filtering specific numeric instances like floating numbers or negative numbers revolves around more detailed patterns. To preserve separators (.) in case of floats, or negative indicators (-), consider the following approach:

floats = re.sub(r'[^\d.-]', '', 'example-123.45') # making sure the decimal number doesn't feel left out 😉

Beware, this pattern does not prevent multiple appearances of '-' or '.' which might not be optimal for a valid digit. Further refinement might be needed.

Extracting multiple numbers with regex

The module re provides more enumeration tools for complex needs. For instance, re.finditer() can return an iterable of numeric appearances in a string, allowing for thorough parsing when multiple numbers are in play:

numbers = [match.group() for match in re.finditer(r'-?\d+\.?\d*', 'level 123 and -456.78')] # good thing we are not using roman numbers here 😅

Upping your filtering game

For a more performance-driven approach, employ frozenset() for rapid character lookup, useful when working with large-scale datasets, where every millisecond matters.

from string import digits allowed_chars = frozenset(digits) cleaned = ''.join(filter(allowed_chars.__contains__, 'example 123')) # "allowed_chars" sounds like a bouncer at a digits-only club 🕺

Unleashing Python string constants

Python's string module hosts a series of string constants, immensely helpful when manipulating various characters. For example, string.digits contains all the numeric characters, thus simplifying their detection:

from string import digits cleaned = ''.join(character for character in 'example123' if character in digits) # it's like inviting only numbers to a special party 🎉

Extend this concept to diverse character sets by enabling constants like string.ascii_letters or string.hexdigits depending on your necessity.

Non-conventional numbers and their handling

You might encounter non-standard numeric formats like roman numerals, currency symbols, or scientific notation. Addressing these peculiar scenarios requires custom regex patterns or parsing logic, ensuring correct identification and intact values.