Explain Codes LogoExplain Codes Logo

Cosine Similarity between 2 Number Lists

python
performance
numpy
pandas
Anton ShumikhinbyAnton Shumikhin·Feb 3, 2025
TLDR

Compute cosine similarity between two number lists using numpy in three steps:

  • Convert lists to numpy arrays.
  • Compute the dot product of the arrays.
  • Normalize by the L2 norms of these arrays.

Here's an illustrative code snippet to pin down the concept.

import numpy as np # A couple of numerical arrays nums1 = np.array([1, 2, 3]) nums2 = np.array([4, 5, 6]) # Cosine similarity calculation cos_sim = np.dot(nums1, nums2) / (np.linalg.norm(nums1) * np.linalg.norm(nums2)) print(cos_sim) # Caution: mind-blowing similarity metric coming through!

The cos_sim variable now holds a value between -1 and 1, where 1 means identical directions and -1 means exact opposites. This is how we roll in the cosine similarity universe!

Cosine similarity decoded

The cosine similarity is much like a secret handshake in the world of algorithmic efficiency, giving us an instant snapshot of the closeness between two numeric sequences or word occurrence vectors.

To visualize the idea, we're considering these number lists as vectors in an n-dimensional space (a.k.a space-time continuum for number lists). The cosine of the angle between these vectors then gives us a metric of similarity, looking at direction and magnitude.

Performance considerations

Now, we've figured out the session for small lists; but what about when data gets hefty? Big data needs careful treatment, and this is where numpy's efficient array operations come in handy, blending speed with practicality.

For pandas users — no worries! Just convert those pandas Series to numpy arrays and use the magic formula.

Extend functionality with alternatives

  • SciPy's spatial.distance: Here's some spicy <geek_ingredient> from the scipy contingent. Apply spatial.distance.cosine method, but remember the finale — subtract the result from 1 to enjoy the taste of similarity.

    from scipy.spatial import distance # "Fly, you digital fools!" into the realm of cosine distance cos_dist = distance.cosine(nums1, nums2) cos_sim = 1 - cos_dist # Transform the distance into similarity print(cos_sim) # Voila! Your cosine similarity served hot
  • Sklearn's pairwise metrics: And now, a score from the sklearn ensemble! The sklearn.metrics.pairwise.cosine_similarity function is a powerful tool for multiple feature sets. Extract the single values from the output matrix and let the comparing games begin!

    from sklearn.metrics.pairwise import cosine_similarity # sklearn version expects two-dimensional arrays cos_sim = cosine_similarity([nums1], [nums2])[0][0] # Extracting the element from the result matrix print(cos_sim) # Voila! Your cosine similarity, sklearn-style!

Edge case mastery

  • Equal-length requirement: Call len() to maintain equilibrium in the universe. Check for equal lengths and be a cosine similarity Zen master.
  • Data types: Keep calm and ensure that your lists contain numerical data types. Numpy is picky about its diet!
  • Zero vectors: Remember, in the rare event of coming across a zero vector, we do not divide by zero (divide by chocolate is much tastier!). Set similarity as zero to avoid a singularity in your code.

Jousting with big data

Working with large datasets or high-dimensional vectors? Fear not, just utilize numpy's vectorization charm. Numpy's C-optimized operations can handle large arrays much more efficiently (aka stop, drop and let numpy roll).

In Python-only environments, boost performance with list comprehensions and built-in sum() functions. They'll burn through dot product calculations faster than light!

def python_cosine_similarity(lst1, lst2): dot_product = sum(a * b for a, b in zip(lst1, lst2)) # Quick! fetch the dot product! norm_lst1 = sum(a**2 for a in lst1) ** 0.5 # I summon the norm of lst1! norm_lst2 = sum(b**2 for a in lst2) ** 0.5 # I summon the norm of lst2! return dot_product / (norm_lst1 * norm_lst2)

Remember, the best chefs are also efficient! Keep memory usage minimal by avoiding unnecessary duplication of large data arrays.

References