Home >Backend Development >Python Tutorial >How to Measure the Similarity Between Text Documents?
Measuring Document Similarity
To ascertain the similarity between two text documents in NLP, the standard approach involves transforming the documents into TF-IDF vectors. These vectors are then utilized to calculate the cosine similarity, a metric commonly employed in information retrieval systems. For more in-depth information, refer to "Introduction to Information Retrieval," an e-book available online.
Implementation in Python
Python provides libraries such as Gensim and scikit-learn that facilitate the calculation of TF-IDF and cosine similarity. In scikit-learn, computing the cosine similarity between documents involves utilizing their TF-IDF vectors:
<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer documents = [open(f).read() for f in text_files] tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T</code>
Plain text documents can be processed directly:
<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away"] tfidf = TfidfVectorizer(min_df=1, stop_words="english").fit_transform(corpus) pairwise_similarity = tfidf * tfidf.T</code>
Interpreting the Results
The resulting sparse matrix pairwise_similarity is square-shaped. To identify the most similar document to a given document, you can utilize NumPy's argmax function, after masking the diagonal elements (representing self-similarity).
<code class="python">import numpy as np arr = pairwise_similarity.toarray() np.fill_diagonal(arr, np.nan) input_doc = "Document to compare" input_idx = corpus.index(input_doc) result_idx = np.nanargmax(arr[input_idx]) most_similar_doc = corpus[result_idx]</code>
The above is the detailed content of How to Measure the Similarity Between Text Documents?. For more information, please follow other related articles on the PHP Chinese website!