How to compute the similarity between two text documents?

python

text-processing

natural-language-processing

similarity-matrix

byAlex Kataev·Feb 15, 2025

To compute text document similarity we use a cosine similarity approach, leveraging TF-IDF vectors and Python's sklearn for a neat solution:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Vectorization and cosine similarity - whole lotta math in one line!
documents = ['Text of document 1', 'Text of document 2']
vectorizer = TfidfVectorizer()  # Like a superhero for your text documents!
matrix = vectorizer.fit_transform(documents)
similarity = cosine_similarity(matrix)  # "Who's your pair, Doc?" -says Cosine

print(f"First Date Result: {similarity[0][1]}")

Output: A date score between 0 (went really bad) and 1 (love at first sight). This code gives you results right away, so you're not left hanging.

Dealing with "Big Data" in Relationships

Finding matches in big document collections? Sparse matrices are your wingman here, saving you time and memory. Scikit-learn makes the transformed TF-IDF matrix sparse by default. If you're on a demanding quest, pairwise_similarity is your trusty companion for optimization. No need to make things complicated with dense matrices!

The Art of Courting Text

Before scoring any similarity points, it's all about good grooming. Lowercasing, stemming, and removing punctuation is the textual equivalent of a shower and a shave:

import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Ever had a shower, shave and haircut in one? Here's the code for it!
stemmer = PorterStemmer()  # Barber shop, anyone?
def preprocess(text):
    # Ditch the capitals
    text = text.lower() 
    # Tokenize and stem
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in ENGLISH_STOP_WORDS]  
    return ' '.join(stemmed_tokens)

# All cleaned up and fancy
processed_docs = [preprocess(doc) for doc in documents]

Want Meaningful Relationships? Go Semantic!

There's more to similarity than meets the eye. Context can make or break the match. Here's where spaCy and Google's Universal Sentence Encoder earn their keep. Spacy's .similarity() method employs document vectors for a quick verdict:

import spacy

# Load a larger model with vectors
nlp = spacy.load('en_core_web_lg')
doc1 = nlp(processed_docs[0])
doc2 = nlp(processed_docs[1])

print(f"Sentimental Similarity Score: {doc1.similarity(doc2)}")

Partial to the deep learning crowd? The Universal Sentence Encoder produces fixed-length vectors that are more than just pretty faces:

import tensorflow_hub as hub

# Load the encoder
encoder = hub.load("https://tfhub.dev/google/universal-sentence-encoder/2")

# Compute the embeddings.
embeddings = encoder(processed_docs)
similarity_deep = cosine_similarity(embeddings)

print(f"Deep Thinker's Similarity Score: {similarity_deep[0][1]}")

Playing the Field with Gensim

With pairwise comparisons in your toolbox, what if you decide to play the field? Need to find the best catch out of many? Wish to cluster potentials? Gensim is your main tool:

from gensim.similarities import MatrixSimilarity
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

# Collection of documents is your playing field
dct = Dictionary(processed_docs)
corpus = [dct.doc2bow(line) for line in processed_docs]

# TF-IDF model for scoring the potentials
tfidf = TfidfModel(corpus)
index = MatrixSimilarity(tfidf[corpus])

# Your pick of the day
query_doc = preprocess('your pick for today')
query_bow = dct.doc2bow(query_doc.split())
similarity_array = index[tfidf[query_bow]]

# Who caught your eye the most?
most_similar_docs = sorted(enumerate(similarity_array), key=lambda item: -item[1])

Seeing is Believing

Visual evidence of matches helps when dealing with similarity matrices. Convert your results into a heatmap for a clearer perspective:

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(similarity, annot=True)  # Nothing like a good see-it-to-believe-it map!
plt.show()