Explain Codes LogoExplain Codes Logo

How to compute the similarity between two text documents?

python
text-processing
natural-language-processing
similarity-matrix
Alex KataevbyAlex Kataev·Feb 15, 2025
TLDR

To compute text document similarity we use a cosine similarity approach, leveraging TF-IDF vectors and Python's sklearn for a neat solution:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Vectorization and cosine similarity - whole lotta math in one line! documents = ['Text of document 1', 'Text of document 2'] vectorizer = TfidfVectorizer() # Like a superhero for your text documents! matrix = vectorizer.fit_transform(documents) similarity = cosine_similarity(matrix) # "Who's your pair, Doc?" -says Cosine print(f"First Date Result: {similarity[0][1]}")

Output: A date score between 0 (went really bad) and 1 (love at first sight). This code gives you results right away, so you're not left hanging.

Dealing with "Big Data" in Relationships

Finding matches in big document collections? Sparse matrices are your wingman here, saving you time and memory. Scikit-learn makes the transformed TF-IDF matrix sparse by default. If you're on a demanding quest, pairwise_similarity is your trusty companion for optimization. No need to make things complicated with dense matrices!

The Art of Courting Text

Before scoring any similarity points, it's all about good grooming. Lowercasing, stemming, and removing punctuation is the textual equivalent of a shower and a shave:

import nltk from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS # Ever had a shower, shave and haircut in one? Here's the code for it! stemmer = PorterStemmer() # Barber shop, anyone? def preprocess(text): # Ditch the capitals text = text.lower() # Tokenize and stem tokens = nltk.word_tokenize(text) stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in ENGLISH_STOP_WORDS] return ' '.join(stemmed_tokens) # All cleaned up and fancy processed_docs = [preprocess(doc) for doc in documents]

Want Meaningful Relationships? Go Semantic!

There's more to similarity than meets the eye. Context can make or break the match. Here's where spaCy and Google's Universal Sentence Encoder earn their keep. Spacy's .similarity() method employs document vectors for a quick verdict:

import spacy # Load a larger model with vectors nlp = spacy.load('en_core_web_lg') doc1 = nlp(processed_docs[0]) doc2 = nlp(processed_docs[1]) print(f"Sentimental Similarity Score: {doc1.similarity(doc2)}")

Partial to the deep learning crowd? The Universal Sentence Encoder produces fixed-length vectors that are more than just pretty faces:

import tensorflow_hub as hub # Load the encoder encoder = hub.load("https://tfhub.dev/google/universal-sentence-encoder/2") # Compute the embeddings. embeddings = encoder(processed_docs) similarity_deep = cosine_similarity(embeddings) print(f"Deep Thinker's Similarity Score: {similarity_deep[0][1]}")

Playing the Field with Gensim

With pairwise comparisons in your toolbox, what if you decide to play the field? Need to find the best catch out of many? Wish to cluster potentials? Gensim is your main tool:

from gensim.similarities import MatrixSimilarity from gensim.models import TfidfModel from gensim.corpora import Dictionary # Collection of documents is your playing field dct = Dictionary(processed_docs) corpus = [dct.doc2bow(line) for line in processed_docs] # TF-IDF model for scoring the potentials tfidf = TfidfModel(corpus) index = MatrixSimilarity(tfidf[corpus]) # Your pick of the day query_doc = preprocess('your pick for today') query_bow = dct.doc2bow(query_doc.split()) similarity_array = index[tfidf[query_bow]] # Who caught your eye the most? most_similar_docs = sorted(enumerate(similarity_array), key=lambda item: -item[1])

Seeing is Believing

Visual evidence of matches helps when dealing with similarity matrices. Convert your results into a heatmap for a clearer perspective:

import seaborn as sns import matplotlib.pyplot as plt sns.heatmap(similarity, annot=True) # Nothing like a good see-it-to-believe-it map! plt.show()