Home  >  Q&A  >  body text

网页爬虫 - python采集百度新闻的原理是什么?

火车头有一个正文提取器,而且不少的采集软件都有这个东西,但是就是一直不知道这些东西到底是怎么实现的?

或是有高人说下实现的原理是多少?

比如步骤?

或是如何用python来实现,可以举个简单的例子

天蓬老师天蓬老师2763 days ago977

reply all(3)I'll reply

  • 高洛峰

    高洛峰2017-04-18 09:05:01


    Source address: http://www.cnblogs.com/jasondan/p/3497757.html

    reply
    0
  • PHP中文网

    PHP中文网2017-04-18 09:05:01

    For more targeted ones, you can use tags such as p and article to make simple judgments. If you need something more general, you can analyze the collected web page data, such as writing an algorithm to calculate the density of Chinese (non-tagged text) to determine whether it is the main text. I haven't done it specifically, but the idea is basically this.

    reply
    0
  • PHP中文网

    PHP中文网2017-04-18 09:05:01

    1. HTTP protocol simulation, (usually using request, urllib2 module)

    2. Information extraction (due to the special nature of HTML documents, xpath, beautifulsoup is generally used)

    reply
    0
  • Cancelreply