现在是刚Python入门,也编写了一些简单的爬虫代码,如通过正则,多线程的爬虫,爬取贴吧里面的图片,爬取过代理网站的IP,还接触了scrapy方面的知识。想继续深入下去,还需要做哪些方面的工作,另外还需要看那些方面的书,以及一些开源项目,求各位知乎大神指点下。。。
谢谢!!!
Python学习网络爬虫主要分3个大的版块:抓取,分析,存储
另外,比较常用的爬虫框架Scrapy,这里最后也详细介绍一下。
首先列举一下本人总结的相关文章,这些覆盖了入门网络爬虫需要的基本概念和技巧:宁哥的小站-网络爬虫
当我们在浏览器中输入一个url后回车,后台会发生什么?比如说你输入宁哥的小站(fireling的数据天地)专注网络爬虫、数据挖掘、机器学习方向。,你就会看到宁哥的小站首页。
简单来说这段过程发生了以下四个步骤:
网络爬虫要做的,简单来说,就是实现浏览器的功能。通过指定url,直接返回给用户所需要的数据,而不需要一步步人工去操纵浏览器获取。
抓取这一步,你要明确要得到的内容是什么?是HTML源码,还是Json格式的字符串等。
1. 最基本的抓取抓取大多数情况属于get请求,即直接从对方服务器上获取数据。
首先,Python中自带urllib及urllib2这两个模块,基本上能满足一般的页面抓取。另外,requests也是非常有用的包,与此类似的,还有httplib2等等。
<code class="language-text">Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
</code>
爬虫不就是个cosplay嘛。<code class="language-js"><span class="nb">eval</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">p</span><span class="p">,</span> <span class="nx">a</span><span class="p">,</span> <span class="nx">c</span><span class="p">,</span> <span class="nx">k</span><span class="p">,</span> <span class="nx">e</span><span class="p">,</span> <span class="nx">d</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">e</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">c</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="nx">c</span> <span class="o"> <span class="nx">a</span> <span class="o">?</span> <span class="s2">""</span> <span class="o">:</span> <span class="nx">e</span><span class="p">(</span><span class="nb">parseInt</span><span class="p">(</span><span class="nx">c</span> <span class="o">/</span> <span class="nx">a</span><span class="p">)))</span> <span class="o">+</span> <span class="p">((</span><span class="nx">c</span> <span class="o">=</span> <span class="nx">c</span> <span class="o">%</span> <span class="nx">a</span><span class="p">)</span> <span class="o">></span> <span class="mi">35</span> <span class="o">?</span> <span class="nb">String</span><span class="p">.</span><span class="nx">fromCharCode</span><span class="p">(</span><span class="nx">c</span> <span class="o">+</span> <span class="mi">29</span><span class="p">)</span> <span class="o">:</span> <span class="nx">c</span><span class="p">.</span><span class="nx">toString</span><span class="p">(</span><span class="mi">36</span><span class="p">))</span>
<span class="p">}</span>
<span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="s1">''</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/^/</span><span class="p">,</span> <span class="nb">String</span><span class="p">))</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="nx">c</span><span class="o">--</span><span class="p">)</span>
<span class="nx">d</span><span class="p">[</span><span class="nx">e</span><span class="p">(</span><span class="nx">c</span><span class="p">)]</span> <span class="o">=</span> <span class="nx">k</span><span class="p">[</span><span class="nx">c</span><span class="p">]</span> <span class="o">||</span> <span class="nx">e</span><span class="p">(</span><span class="nx">c</span><span class="p">);</span>
<span class="nx">k</span> <span class="o">=</span> <span class="p">[</span><span class="kd">function</span><span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="nx">d</span><span class="p">[</span><span class="nx">e</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">];</span>
<span class="nx">e</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="s1">'\\w+'</span>
<span class="p">}</span>
<span class="p">;</span>
<span class="nx">c</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">;</span><span class="k">while</span> <span class="p">(</span><span class="nx">c</span><span class="o">--</span><span class="p">)</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">k</span><span class="p">[</span><span class="nx">c</span><span class="p">])</span>
<span class="nx">p</span> <span class="o">=</span> <span class="nx">p</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="k">new</span> <span class="nb">RegExp</span><span class="p">(</span><span class="s1">'\\b'</span> <span class="o">+</span> <span class="nx">e</span><span class="p">(</span><span class="nx">c</span><span class="p">)</span> <span class="o">+</span> <span class="s1">'\\b'</span><span class="p">,</span><span class="s1">'g'</span><span class="p">),</span> <span class="nx">k</span><span class="p">[</span><span class="nx">c</span><span class="p">]);</span>
<span class="k">return</span> <span class="nx">p</span><span class="p">;</span>
<span class="p">}(</span><span class="s1">'m 5$=[\'\',\'b\',\'f\',\'e\',\'h\'],l,7;g(6[5$[1]]){l=7.9(5$[0]);c=l.8(d,a);l.8(i,j,c);6[5$[4]]=6[5$[1]](6[5$[2]]=6[5$[2]][5$[3]](7,l.k(5$[0])))}'</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="mi">23</span><span class="p">,</span> <span class="s1">'|||||_|w|r|splice|split|0x1|simpleLoader||y|replace|condition|if|flightLoader|x|0x0|join||var'</span><span class="p">.</span><span class="nx">split</span><span class="p">(</span><span class="s1">'|'</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="p">{}))</span>
</span></code>
根据题主的描述,我猜测应该是已经根据网上的一些教程、博客写爬虫抓取过一些简单的内容,然后想要继续深入的时候,发现网上更进一步的学习资源不那么好找了。会抓取贴吧图片(猜测是网上教程抓取一个帖子下的图片而不是全贴吧)、能够使用多线程、抓取代理 IP、有 scrapy 的经验,接下来该怎么做,给你一点建议(这个问题是大半年以前提的,估计你已经不需要我的建议了 ^_^),仅供参考,错了你也没法把我怎么样~~~现在的网页普遍支持gzip压缩,这往往可以解决大量传输时间,以VeryCD的主页为例,未压缩版本247K,压缩了以后45K,为原来的1/5。这就意味着抓取速度会快5倍。
然而python的urllib/urllib2默认都不支持压缩,要返回压缩格式,必须在request的header里面写明’accept-encoding’,然后读取response后更要检查header查看是否有’content-encoding’一项来判断是否需要解码,很繁琐琐碎。如何让urllib2自动支持gzip, defalte呢?
其实可以继承BaseHanlder类,然后build_opener的方式来处理:
<code class="language-text">import urllib2 from gzip import GzipFile from StringIO import StringIO class ContentEncodingProcessor(urllib2.BaseHandler): """A handler to add gzip capabilities to urllib2 requests """ # add headers to requests def http_request(self, req): req.add_header("Accept-Encoding", "gzip, deflate") return req # decode def http_response(self, req, resp): old_resp = resp # gzip if resp.headers.get("content-encoding") == "gzip": gz = GzipFile( fileobj=StringIO(resp.read()), mode="r" ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) resp.msg = old_resp.msg # deflate if resp.headers.get("content-encoding") == "deflate": gz = StringIO( deflate(resp.read()) ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to add info() and resp.msg = old_resp.msg return resp # deflate support import zlib def deflate(data): # zlib only provides the zlib compress format, not the deflate format; try: # so on top of all there's this workaround: return zlib.decompress(data, -zlib.MAX_WBITS) except zlib.error: return zlib.decompress(data)
</code>
做过一段时间爬虫 freelancer, 接过5个项目."每秒10个请求就差不多了"
这个要求看似非常简单, 但我相信会有少部分程序员在单机情况下无法实现. 下面来介绍下如何在PC上实现这个要求.简单测试一下(阿里云服务器, 1M带宽)time curl http://www.baidu.com
耗时 0.454 秒.<code class="language-python3"><span class="k">class</span> <span class="nc">WPooler</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">maxsize</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">concurrent</span><span class="o">=</span><span class="mi">4</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">queue</span> <span class="o">=</span> <span class="n">queue</span><span class="o">.</span><span class="n">Queue</span><span class="p">(</span><span class="n">maxsize</span><span class="o">=</span><span class="n">maxsize</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">concurrent</span> <span class="o">=</span> <span class="n">concurrent</span>
<span class="bp">self</span><span class="o">.</span><span class="n">threads</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">concurrent</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">threads</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">WThreader</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">queue</span><span class="p">))</span>
<span class="k">for</span> <span class="n">thread</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">threads</span><span class="p">:</span>
<span class="n">thread</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">do</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">func</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(),</span> <span class="n">kwargs</span><span class="o">=</span><span class="p">{},</span> <span class="n">callback</span><span class="o">=</span><span class="k">None</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">queue</span><span class="o">.</span><span class="n">put</span><span class="p">((</span><span class="n">func</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">callback</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">async</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">callback</span><span class="o">=</span><span class="k">None</span><span class="p">):</span>
<span class="k">return</span> <span class="n">Asyncer</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">callback</span><span class="o">=</span><span class="n">callback</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">wait</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">queue</span><span class="o">.</span><span class="n">join</span><span class="p">()</span>
<span class="k">class</span> <span class="nc">WThreader</span><span class="p">(</span><span class="n">threading</span><span class="o">.</span><span class="n">Thread</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">queue</span><span class="p">):</span>
<span class="n">threading</span><span class="o">.</span><span class="n">Thread</span><span class="o">.</span><span class="n">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">daemon</span><span class="o">=</span><span class="k">True</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">queue</span> <span class="o">=</span> <span class="n">queue</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">while</span> <span class="k">True</span><span class="p">:</span>
<span class="n">func</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">callback</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">queue</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">block</span><span class="o">=</span><span class="k">True</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">func</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="k">if</span> <span class="n">callback</span><span class="p">:</span>
<span class="n">callback</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="n">logging</span><span class="o">.</span><span class="n">exception</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
<span class="k">finally</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">queue</span><span class="o">.</span><span class="n">task_done</span><span class="p">()</span>
</code>
从入门到精通:如何入门 Python 爬虫? - 谢科的回答
爬虫是在没有(用)API获取数据的情况下以Hack的方式获取数据的一种有效手段;进阶,就是从爬取简单页面逐渐过渡到复杂页面的过程。针对特定需求,爬取的网站类型不同,可以使用不同的python库相结合,达到快速抓取数据的目的。但是无论使用什么库,第一步分析目标网页的页面元素发现抓取规律总是必不可少的:有些爬虫是通过访问固定url前缀拼接不同的后缀进行循环抓取,有些是通过一个起始url作为种子url继而获取更多的目标url递归抓取;有些网页是静态数据可以直接获取,有些网页是js渲染数据需要构造二次请求……如果统统都写下来,一篇文章是不够的,这里举几个典型的栗子:
1. 页面url为固定url前缀拼接不同的后缀:
以从OPENISBN网站抓取图书分类信息为例,我有一批图书需要入库,但是图书信息不全,比如缺少图书分类,此时需要去"http://openisbn.com/"网站根据ISBN号获取图书的分类信息。如《失控》这本书, ISBN: 7513300712 ,对应url为 "http://openisbn.com/isbn/7513300712/ " ,分析url规律就是以 "http://openisbn.com/isbn/" 作为固定前缀然后拼接ISBN号得到;然后分析页面元素,Chrome右键 —> 检查:
我先直接使用urllib2 + re 来获得“Category:” 信息:
<code class="language-python"><span class="c">#-*- coding:UTF-8 -*-</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="n">isbn</span> <span class="o">=</span> <span class="s">'7513300712'</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">'http://openisbn.com/isbn/{0}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">isbn</span><span class="p">)</span>
<span class="n">category_pattern</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">r'Category: *.*, '</span><span class="p">)</span>
<span class="n">html</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">category_info</span> <span class="o">=</span> <span class="n">category_pattern</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">html</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">category_info</span><span class="p">)</span> <span class="o">></span> <span class="mi">0</span> <span class="p">:</span>
<span class="k">print</span> <span class="n">category_info</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span> <span class="s">'get category failed.'</span>
</code>
编写了一些简单的爬虫代码,如通过正则,多线程的爬虫,爬取贴吧里面的图片,爬取过代理网站的IP,还接触了scrapy方面的知识。<code class="language-text">http://weixin.sogou.com/gzh?openid=oIWsFt4ORWCSUS8szIwVLoRuAq9M
</code>