Heim >Backend-Entwicklung >Python-Tutorial >python 网络爬虫初级实现代码

python 网络爬虫初级实现代码

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2016-06-10 15:06:101896Durchsuche

首先，我们来看一个Python抓取网页的库：urllib或urllib2。

那么urllib与urllib2有什么区别呢？
可以把urllib2当作urllib的扩增，比较明显的优势是urllib2.urlopen()可以接受Request对象作为参数，从而可以控制HTTP Request的header部。
做HTTP Request时应当尽量使用urllib2库，但是urllib.urlretrieve()函数以及urllib.quote等一系列quote和unquote功能没有被加入urllib2中，因此有时也需要urllib的辅助。

urllib.open()这里传入的参数要遵循一些协议，比如http，ftp，file等。例如：

urllib.open('http://www.baidu.com')
urllib.open('file:D\Python\Hello.py')

现在有一个例子，下载一个网站上所有gif格式的图片。那么Python代码如下：

import re
import urllib

def getHtml(url):
 page = urllib.urlopen(url)
 html = page.read()
 return html

def getImg(html):
 reg = r'src="(.*&#63;\.gif)"'
 imgre = re.compile(reg)
 imgList = re.findall(imgre,html)
 print imgList
 cnt = 1
 for imgurl in imgList:
  urllib.urlretrieve(imgurl,'%s.jpg' %cnt)
  cnt += 1

if __name__ == '__main__':
 html = getHtml('http://www.baidu.com')
 getImg(html)

根据上面的方法，我们可以抓取一定的网页，然后提取我们所需要的数据。

实际上，我们利用urllib这个模块来做网络爬虫效率是极其低下的，下面我们来介绍Tornado Web Server。
Tornado web server是使用Python编写出来的一个极轻量级、高可伸缩性和非阻塞IO的Web服务器软件，著名的Friendfeed网站就是使用它搭建的。Tornado跟其他主流的Web服务器框架（主要是Python框架）不同是采用epoll非阻塞IO，响应快速，可处理数千并发连接，特别适用用于实时的Web服务。

用Tornado Web Server来抓取网页效率会比较高。
从Tornado的官网来看，还要安装backports.ssl_match_hostname，官网如下：

http://www.tornadoweb.org/en/stable/

import tornado.httpclient

def Fetch(url):
 http_header = {'User-Agent' : 'Chrome'}
 http_request = tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=200,request_timeout=600)
 print 'Hello'
 http_client = tornado.httpclient.HTTPClient()
 print 'Hello World'

 print 'Start downloading data...'
 http_response = http_client.fetch(http_request)
 print 'Finish downloading data...'

 print http_response.code

 all_fields = http_response.headers.get_all()
 for field in all_fields:
  print field

 print http_response.body

if __name__ == '__main__':
 Fetch('http://www.baidu.com')

urllib2的常见方法：

（1）info() 获取网页的Header信息

（2）getcode() 获取网页的状态码

（3）geturl() 获取传入的网址

（4）read() 读取文件的内容

Stellungnahme：

Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn

Vorheriger Artikel：Python实现简单多线程任务队列Nächster Artikel：Python3.2模拟实现webqq登录

In Verbindung stehende Artikel

Mehr sehen