Home >Backend Development >Python Tutorial >How Can You Determine the Similarity Between Text Documents in Python?
Determining Text Similarity
In natural language processing (NLP), determining the similarity between two text documents is crucial. The most common approach is to convert the documents into TF-IDF vectors and calculate the cosine similarity.
Implementing TF-IDF and Cosine Similarity
In Python, the Gensim and scikit-learn packages provide implementations of TF-IDF and cosine similarity. The following code, using scikit-learn, transforms documents into TF-IDF vectors and computes their pairwise similarity:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer # Load documents documents = [open(f).read() for f in text_files] # Create TF-IDF vectorizer tfidf = TfidfVectorizer().fit_transform(documents) # Compute pairwise similarity pairwise_similarity = tfidf * tfidf.T</code>
Interpreting the Results
Pairwise_similarity is a sparse matrix representing the similarity scores between documents. Each document's similarity to itself is 1, so these values are masked out. The code below finds the most similar document to a given input document:
<code class="python">import numpy as np # Input document index input_idx = corpus.index(input_doc) # Mask out diagonal and find the most similar document np.fill_diagonal(pairwise_similarity.toarray(), np.nan) result_idx = np.nanargmax(pairwise_similarity[input_idx]) # Get the most similar document similar_doc = corpus[result_idx]</code>
Other Methods
Gensim offers additional options for text similarity tasks. Another resource to explore is [this Stack Overflow question](https://stackoverflow.com/questions/52757816/how-to-find-text-similarity-between-two-documents).
The above is the detailed content of How Can You Determine the Similarity Between Text Documents in Python?. For more information, please follow other related articles on the PHP Chinese website!