Home  >  Article  >  Backend Development  >  How to Calculate Cosine Similarity of Two Text Strings in Pure Python?

How to Calculate Cosine Similarity of Two Text Strings in Pure Python?

Susan Sarandon
Susan SarandonOriginal
2024-10-30 08:05:02828browse

How to Calculate Cosine Similarity of Two Text Strings in Pure Python?

How to Calculate Cosine Similarity of Two Text Strings without External Libraries

In text analysis, cosine similarity is a measure of the similarity between two texts based on their shared vocabulary. While external libraries can be used to calculate this measure, it's also possible to implement a simple pure-Python function:

<code class="python">import math
import re
from collections import Counter

WORD = re.compile(r"\w+")

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)</code>

This function takes two vectors vec1 and vec2 as input and calculates their cosine similarity. Here's how to use it to compare two text strings text1 and text2:

<code class="python">text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)</code>

Output:

Cosine: 0.861640436855

This indicates that the two text strings are highly similar.

The above is the detailed content of How to Calculate Cosine Similarity of Two Text Strings in Pure Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn