Home >Backend Development >Python Tutorial >How to Measure the Similarity Between Text Documents?

How to Measure the Similarity Between Text Documents?

DDD
DDDOriginal
2024-10-23 06:55:021072browse

How to Measure the Similarity Between Text Documents?

Determining the Similarity Between Text Documents

Measuring Document Similarity

To ascertain the similarity between two text documents in NLP, the standard approach involves transforming the documents into TF-IDF vectors. These vectors are then utilized to calculate the cosine similarity, a metric commonly employed in information retrieval systems. For more in-depth information, refer to "Introduction to Information Retrieval," an e-book available online.

Implementation in Python

Python provides libraries such as Gensim and scikit-learn that facilitate the calculation of TF-IDF and cosine similarity. In scikit-learn, computing the cosine similarity between documents involves utilizing their TF-IDF vectors:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T</code>

Plain text documents can be processed directly:

<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away"]
tfidf = TfidfVectorizer(min_df=1, stop_words="english").fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T</code>

Interpreting the Results

The resulting sparse matrix pairwise_similarity is square-shaped. To identify the most similar document to a given document, you can utilize NumPy's argmax function, after masking the diagonal elements (representing self-similarity).

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "Document to compare"
input_idx = corpus.index(input_doc)
result_idx = np.nanargmax(arr[input_idx])
most_similar_doc = corpus[result_idx]</code>

The above is the detailed content of How to Measure the Similarity Between Text Documents?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn