Home > Article > Backend Development > How to Calculate Cosine Similarity of Two Text Strings in Pure Python?
How to Calculate Cosine Similarity of Two Text Strings without External Libraries
In text analysis, cosine similarity is a measure of the similarity between two texts based on their shared vocabulary. While external libraries can be used to calculate this measure, it's also possible to implement a simple pure-Python function:
<code class="python">import math import re from collections import Counter WORD = re.compile(r"\w+") def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) return Counter(words)</code>
This function takes two vectors vec1 and vec2 as input and calculates their cosine similarity. Here's how to use it to compare two text strings text1 and text2:
<code class="python">text1 = "This is a foo bar sentence ." text2 = "This sentence is similar to a foo bar sentence ." vector1 = text_to_vector(text1) vector2 = text_to_vector(text2) cosine = get_cosine(vector1, vector2) print("Cosine:", cosine)</code>
Output:
Cosine: 0.861640436855
This indicates that the two text strings are highly similar.
The above is the detailed content of How to Calculate Cosine Similarity of Two Text Strings in Pure Python?. For more information, please follow other related articles on the PHP Chinese website!