火车头有一个正文提取器,而且不少的采集软件都有这个东西,但是就是一直不知道这些东西到底是怎么实现的?
或是有高人说下实现的原理是多少?
比如步骤?
或是如何用python来实现,可以举个简单的例子
PHP中文网2017-04-18 09:05:01
For more targeted ones, you can use tags such as p and article to make simple judgments. If you need something more general, you can analyze the collected web page data, such as writing an algorithm to calculate the density of Chinese (non-tagged text) to determine whether it is the main text. I haven't done it specifically, but the idea is basically this.
PHP中文网2017-04-18 09:05:01
HTTP protocol simulation, (usually using request, urllib2 module)
Information extraction (due to the special nature of HTML documents, xpath, beautifulsoup is generally used)