Home  >  Article  >  Backend Development  >  How to Measure Text Similarity using TF-IDF and Cosine Similarity?

How to Measure Text Similarity using TF-IDF and Cosine Similarity?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-10-23 06:53:30243browse

How to Measure Text Similarity using TF-IDF and Cosine Similarity?

Measuring Textual Similarity with TF-IDF and Cosine Similarity

Determining the similarity between two text documents is a crucial task in text mining and information retrieval. One popular approach involves utilizing TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity.

TF-IDF assigns a weight to each word in a document based on its frequency in that document and its rarity across the document corpus. Documents with similar word patterns will share higher TF-IDF vectors.

Cosine similarity measures the angle between two vectors, providing a value between 0 (no similarity) and 1 (perfect similarity). In our case, the TF-IDF vectors of the two documents form these vectors, and the cosine similarity quantifies their angle.

Python Implementation

In Python, using the scikit-learn and Gensim packages, computing pairwise similarities is straightforward:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T</code>

Alternatively, if the documents are already strings, use:

<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away", "..."]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T</code>

Interpreting Results

pairwise_similarity is a sparse matrix representing the similarity between each document pair. To find the document most similar to a specific document, mask out the document's similarity to itself (set it to NaN) and find the maximum value in its row using np.nanargmax():

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "The scikit-learn docs are Orange and Blue"
input_idx = corpus.index(input_doc)
result_idx = np.nanargmax(arr[input_idx])
similar_doc = corpus[result_idx]</code>

Other Considerations

For large corpora and vocabularies, using a sparse matrix is more efficient than converting to NumPy arrays.

By adjusting the parameters in TfidfVectorizer, such as min_df for minimum document frequency, the TF-IDF computation can be customized to suit specific requirements.

Additional Resources

  • [Introduction to Information Retrieval](http://infolab.stanford.edu/~backrub/classes/2002/cs276/handouts/04-tfidf.pdf)
  • [Computing Pairwise Similarities with Gensim](https://stackoverflow.com/questions/23752770/computing-pairwise-similarities-with-gensim)

The above is the detailed content of How to Measure Text Similarity using TF-IDF and Cosine Similarity?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn