Home  >  Article  >  Backend Development  >  How Can You Determine the Similarity Between Text Documents in Python?

How Can You Determine the Similarity Between Text Documents in Python?

Patricia Arquette
Patricia ArquetteOriginal
2024-10-23 06:52:02121browse

How Can You Determine the Similarity Between Text Documents in Python?

Determining Text Similarity

In natural language processing (NLP), determining the similarity between two text documents is crucial. The most common approach is to convert the documents into TF-IDF vectors and calculate the cosine similarity.

Implementing TF-IDF and Cosine Similarity

In Python, the Gensim and scikit-learn packages provide implementations of TF-IDF and cosine similarity. The following code, using scikit-learn, transforms documents into TF-IDF vectors and computes their pairwise similarity:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

# Load documents
documents = [open(f).read() for f in text_files]

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer().fit_transform(documents)

# Compute pairwise similarity
pairwise_similarity = tfidf * tfidf.T</code>

Interpreting the Results

Pairwise_similarity is a sparse matrix representing the similarity scores between documents. Each document's similarity to itself is 1, so these values are masked out. The code below finds the most similar document to a given input document:

<code class="python">import numpy as np

# Input document index
input_idx = corpus.index(input_doc)

# Mask out diagonal and find the most similar document
np.fill_diagonal(pairwise_similarity.toarray(), np.nan)
result_idx = np.nanargmax(pairwise_similarity[input_idx])

# Get the most similar document
similar_doc = corpus[result_idx]</code>

Other Methods

Gensim offers additional options for text similarity tasks. Another resource to explore is [this Stack Overflow question](https://stackoverflow.com/questions/52757816/how-to-find-text-similarity-between-two-documents).

The above is the detailed content of How Can You Determine the Similarity Between Text Documents in Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn