Explain Codes LogoExplain Codes Logo

Count number of occurrences of a substring in a string

python
string-manipulation
regular-expressions
performance-optimization
Alex KataevbyAlex Kataev·Oct 20, 2024
TLDR

Use str.count(sub) for quickly counting substrings in a main string:

txt = "hello world, hello universe" sub = "hello" print(txt.count(sub)) # Count: 2. The universe seems to be unusually friendly today!

This method returns the count of non-overlapping "hello" occurrences within the txt string.

Battle with overlapping substrings

If you wish to combat against overlapping substrings, str.count() comes up short. In this fight, a regular expression with a lookahead assertion is your key weapon:

import re txt = "hellohello" sub = "(?=hello)" # Like looking for Waldo in a crowd of Waldos! Let's do this! print(len(re.findall(sub, txt))) # Count: 2

The pattern finds all instances where "hello" starts, even if it begins the next "hello".

The old-school way: manual counting

If pre-built methods aren't your style, you can always go the hand-crafted artisan waymanual traversal:

def count_substrings(string, substring): count = start = 0 while start >= 0: # Classic Find & Seek game start = string.find(substring, start) + 1 if start > 0: count += 1 return count txt = "hellohello" sub = "hello" print(count_substrings(txt, sub)) # Count: 2; or should we say, "Count: too"?

In this function, find() starts searching from the last found position, incrementing a counter each time the substring is found.

Heads up: case sensitivity and normalization

Remember str.count() is case sensitive. For case-insensitive tally, normalize both strings to lower or upper:

txt.lower().count(sub.lower()) # Consistency is key, even in letter casing!

Also, normalize your string to handle Unicode anomalies.

Performance tales and wails

While neat, manual counting and regex may cause performance deterioration compared to str.count(). When dealing with long strings, efficiency matters.

Edge of the edge cases

Stay vigilant with edge cases like empty strings or substrings. They might mislead script outcomes:

txt = "hello" sub = "" print(txt.count(sub)) # To your surprise, it's 6; because in coding, even nothing is something!