Home  >  Q&A  >  body text

django - python实现两篇文章相似度分析

如题,最近有需求要做文章相似度分析,需求很简单,具体就是对比两篇分别300字左右的文章的相似度情况,目前查到的方法,需要先中文分词(jieba),然后对比相似度,时间紧任务重,不知道有没有做过类似功能的大神可以指点一二的

PHP中文网PHP中文网2741 days ago647

reply all(2)I'll reply

  • PHP中文网

    PHP中文网2017-04-18 10:33:37

    You have already given the first step. First segment the articles into Chinese words, and then calculate the tf-idf value of each word in the two articles. Then calculate the cosine similarity of the two articles, which can be implemented using gensim in Python.

    If you have any questions, please continue to ask.

    reply
    0
  • 迷茫

    迷茫2017-04-18 10:33:37

    Please add to the answer on the first floor
    When using cosine similarity or TF-IDF, stop words should be removed first.

    Stop word is translated from the English word: stopword. It turns out that in English, you will encounter many frequently used words or words such as a, the, or, etc., often articles, prepositions, adverbs or conjunctions, etc.
    Because words such as adverbs and conjunctions do not greatly affect our judgment of semantics.

    But simple cosine similarity and TF-IDF are not very reliable under certain circumstances.
    Push your own link 2333 here

    It is recommended to use textrank in combination with the above algorithm

    reply
    0
  • Cancelreply