用Python27些爬虫,想要爬取一些网站,我需要判断网页是否可以爬取,第一反应是通过状态码来判断,但是写完运行后发现有许多目标网站访问它不存在的页面时会返回一个404错误页面,可他的状态码却是200,结果爬回来好多根本就不存在的页面。这个本来是网站设置的问题,但是现在也不能用状态码来判断了,请问还有什么方法可以正确判断一个页面是不是404该不该爬?
阿神2017-04-18 10:26:44
First of all, the 200 status code is the network connection status, so you only judge 200 and it does not satisfy all websites.
Secondly, when writing a crawler, you should actually see what the rules of these websites are. You can make a manual judgment first and look for the rules. For example, see if the content returned by the web page has any characteristics.
黄舟2017-04-18 10:26:44
Make a judgment on the content of the web page. If there is no content in the web page, return it directly.
怪我咯2017-04-18 10:26:44
Even if the page status code is 200, the returned 404 page should have different html elements from the normal crawlable page html. You can also judge whether it is a 404 page based on whether there are specific html elements