Explain Codes LogoExplain Codes Logo

Find the similarity metric between two strings

python
string-similarity
python-stdlib
fuzzy-matching
Alex KataevbyAlex Kataev·Sep 24, 2024
TLDR

Estimate similarity between two strings using the measure known as the Levenshtein ratio. Using the python-Levenshtein module, you get a score from 0 (no similarity) to 1 (identical strings). Here's a sample usage:

from Levenshtein import ratio similarity_score = ratio("kitten", "sitting") print(similarity_score) # Prints: "0.57142", quite similar given they're different pets!

This way, you get immediate quantifiable metrics for string similarity.

String similarity with Python: A deep dive

When looking at the similarity between two strings, remember that it's not all about finding matches. There are numerous methods and techniques to determine how close two strings resemble each other.

Python standard library to the rescue!

You don't always need external modules for string comparison. The Python difflib module's class SequenceMatcher can be incredibly quick and effective "out-of-the-box" solutions.

from difflib import SequenceMatcher def similar(a, b): return SequenceMatcher(None, a, b).ratio() print(similar("Apple", "Appel")) # Yields a high similarity, an apple by another spelling is still tasty print(similar("Apple", "Mango")) # Lower similarity, confirms fruits are indeed different

Advanced metrics with Jaro-Winkler and Jellyfish

Python's jellyfish library supports robust measures including Jaro distance and Levenshtein distance. These come in handy when you need a comprehensive comparison process.

import jellyfish jaro_score = jellyfish.jaro_distance("Apple", "Appel") levenshtein_score = jellyfish.levenshtein_distance("Apple", "Appel") print(jaro_score, levenshtein_score) # Who knew there was so much to say about apples!

Taking things up a notch with "TheFuzz"

Better known formerly as FuzzyWuzzy, TheFuzz is a resourceful library for efficient similarity calculations, with functions like fuzz.ratio and fuzz.token_sort_ratio.

from thefuzz import fuzz print(fuzz.ratio("Apple", "Appel")) # Ignores the order of letters to see the apple in Appel print(fuzz.token_sort_ratio("introduction to algorithms", "intro to algo")) # Still the same intro, all in a nutshell

Factors to consider in string similarity

When evaluating similarity, always weigh the context and the validity of the method to your specific use case. Let's explore key considerations:

Adjusting for reordered terms with token sort ratio

Handling variable order of words calls for token sort ratio:

from thefuzz import fuzz # Great for shuffled strings print(fuzz.token_sort_ratio("algorithm intro", "intro to algo")) # Different orders, same content!

Dealing with unequal lengths

Strings often vary in lengths. Padding applied to shorter strings can ensure fair comparisons.

def padded_similarity(str1, str2, pad_char=' '): max_len = max(len(str1), len(str2)) padded_str1 = str1.ljust(max_len, pad_char) padded_str2 = str2.ljust(max_len, pad_char) return similar(padded_str1, padded_str2) print(padded_similarity("algorithm", "algo ")) # Now length doesn't matter!

Adjusting comparison with normalization

Normalization can adjust similarity scores between strings. It's quite handy when dealing with variations in casing or characters.

import re def normalize(s): return re.sub('[^A-Za-z0-9]+', '', s).lower() normalized_score = similar(normalize("Algo-rithm"), normalize("algOrithm")) print(normalized_score) # Case and punctuation no longer a bother