Home  >  Article  >  Backend Development  >  How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?

How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-10-23 06:47:02426browse

How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?

How to Calculate Text Document Similarity

Computing Pairwise Similarities

The most common method for determining the similarity between two text documents is to convert them into TF-IDF (Term Frequency-Inverse Document Frequency) vectors and then use cosine similarity to compare them. This approach is covered in textbooks on information retrieval and detailed in "Introduction to Information Retrieval."

Python libraries like Gensim and scikit-learn provide implementations of TF-IDF conversions and cosine similarity calculations. With scikit-learn, the following code snippet performs cosine similarity computations:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

# Extract documents from text files
documents = [open(f).read() for f in text_files]

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer().fit_transform(documents)

# Calculate pairwise cosine similarity
pairwise_similarity = tfidf * tfidf.T</code>

Alternatively, for plain text documents:

<code class="python">corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   

# Create a TF-IDF vectorizer with minimum frequency and exclusion of stop words
vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   

# Apply TF-IDF transformation
tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       

# Calculate pairwise cosine similarity
pairwise_similarity = tfidf * tfidf.T </code>

Interpreting the Results

pairwise_similarity is a sparse matrix where each row and column represent a document in the corpus. Converting the sparse matrix to a NumPy array reveals that each cell represents the similarity between the two corresponding documents.

For instance, to determine the document most similar to "The scikit-learn docs are Orange and Blue," locate its index in the corpus and then apply np.nanargmax to the corresponding row after masking out the diagonal (representing self-similarity) with np.fill_diagonal():

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()     
np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            

input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
print(corpus[result_idx])</code>

Note that for large datasets, using a sparse matrix conserves memory. Alternatively, consider using pairwise_similarity.shape to mask self-similarity and argmax() directly:

<code class="python">n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
pairwise_similarity[input_idx].argmax()  </code>

The above is the detailed content of How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn