如题,最近有需求要做文章相似度分析,需求很简单,具体就是对比两篇分别300字左右的文章的相似度情况,目前查到的方法,需要先中文分词(jieba),然后对比相似度,时间紧任务重,不知道有没有做过类似功能的大神可以指点一二的
PHP中文网2017-04-18 10:33:37
You have already given the first step. First segment the articles into Chinese words, and then calculate the tf-idf value of each word in the two articles. Then calculate the cosine similarity of the two articles, which can be implemented using gensim in Python.
If you have any questions, please continue to ask.
迷茫2017-04-18 10:33:37
Please add to the answer on the first floor
When using cosine similarity or TF-IDF, stop words should be removed first.
Stop word is translated from the English word: stopword. It turns out that in English, you will encounter many frequently used words or words such as a, the, or, etc., often articles, prepositions, adverbs or conjunctions, etc.
Because words such as adverbs and conjunctions do not greatly affect our judgment of semantics.
But simple cosine similarity and TF-IDF are not very reliable under certain circumstances.
Push your own link 2333 here
It is recommended to use textrank in combination with the above algorithm