Home  >  Article  >  Backend Development  >  How can I calculate cosine similarity between two sentences without using external libraries?

How can I calculate cosine similarity between two sentences without using external libraries?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-01 08:20:30839browse

How can I calculate cosine similarity between two sentences without using external libraries?

Calculating Cosine Similarity for Sentence Strings

Cosine similarity is a measure of the correlation between two vectors. In the context of text processing, it can be used to determine the similarity between two sentences. To calculate cosine similarity for two strings without external libraries, follow these steps:

  1. Tokenize the strings: Break each string into individual words, known as tokens.
  2. Create word vectors: For each string, create a dictionary (vector) where the keys are unique words, and the values are the frequencies of those words.
  3. Calculate dot product: Compute the dot product of the two vectors by summing the products of corresponding elements.
  4. Calculate magnitudes: Find the magnitude of each vector by squaring and summing all its elements, then taking the square root.
  5. Normalize: Divide the dot product by the product of the magnitudes to obtain the normalized cosine similarity.

A simple Python implementation:

<code class="python">import math
import re
from collections import Counter

WORD = re.compile(r"\w+")

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)</code>

Example usage:

<code class="python">text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)</code>

Output:

Cosine: 0.861640436855

Note that this implementation does not include TF-IDF weighting, which can improve the accuracy of cosine similarity for larger datasets.

The above is the detailed content of How can I calculate cosine similarity between two sentences without using external libraries?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn