這篇文章主要介紹了關於Pyspider中給爬蟲偽造隨機請求頭的實例,有著一定的參考價值,現在分享給大家,有需要的朋友可以參考一下
Pyspider 中採用了tornado函式庫來做http 請求,在請求過程中可以加入各種參數,例如請求連結逾時時間,請求傳送資料逾時時間,請求頭等等,但是根據pyspider的原始框架,給爬蟲加入參數只能透過crawl_config這個Python字典來完成(如下所示),框架程式碼將這個字典中的參數轉換成task 數據,進行http請求。這個參數的缺點是不方便給每一次請求做隨機請求頭。
crawl_config = { "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "timeout": 120, "connect_timeout": 60, "retries": 5, "fetch_type": 'js', "auto_recrawl": True, }
這裡寫出給爬蟲添加隨機請求頭的方法:
##1 、寫腳本,將腳本放置在pyspider 的libs 資料夾下,命名為header_switch.py
#!/usr/bin/env python # -*- coding:utf-8 -*- # Created on 2017-10-18 11:52:26 import random import time class HeadersSelector(object): """ Header 中缺少几个字段 Host 和 Cookie """ headers_1 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "DNT": "1", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4", "Referer": "https://www.baidu.com/s?wd=%BC%96%E7%A0%81&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=If-None-Match&inputT=7282&rsv_t", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", } # 网上找的浏览器 headers_2 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", "Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT": "1", "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnPAvZN", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4", } # window 7 系统浏览器 headers_3 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0", "Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*", "DNT": "1", "Referer": "https://www.baidu.com/s?wd=http%B4%20Pragma&rsf=1&rsp=4&f=1&oq=Pragma&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6", } # Linux 系统 firefox 浏览器 headers_4 = { "Proxy-Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0", "Accept": "*/*", "DNT": "1", "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnP", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6", } # Win10 系统 firefox 浏览器 headers_5 = { "Connection": "keep-alive", "Pragma": "no-cache", "Cache-Control": "no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", } # Win10 系统 Chrome 浏览器 headers_6 = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate, sdch", "Accept-Language": "zh-CN,zh;q=0.8", "Pragma": "no-cache", "Cache-Control": "no-cache", "Connection": "keep-alive", "DNT": "1", "Referer": "https://www.baidu.com/s?wd=If-None-Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rq", "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", } # win10 系统浏览器 def __init__(self): pass def select_header(self): n = random.randint(1, 6) switch={ 1: self.headers_1 2: self.headers_2 3: self.headers_3 4: self.headers_4 5: self.headers_5 6: self.headers_6 } headers = switch[n] return headers
##其中,我只寫了6個請求頭,如果爬蟲的量非常大,完全可以寫更多的請求頭,甚至上百個,然後將random的隨機範圍擴大,進行選擇。
2、在pyspider 腳本中寫如下程式碼:
#!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2017-08-18 11:52:26 from pyspider.libs.base_handler import * from pyspider.addings.headers_switch import HeadersSelector import sys defaultencoding = 'utf-8' if sys.getdefaultencoding() != defaultencoding: reload(sys) sys.setdefaultencoding(defaultencoding) class Handler(BaseHandler): crawl_config = { "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36", "timeout": 120, "connect_timeout": 60, "retries": 5, "fetch_type": 'js', "auto_recrawl": True, } @every(minutes=24 * 60) def on_start(self): header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" orig_href = 'http://sww.bjxch.gov.cn/gggs.html' self.crawl(orig_href, callback=self.index_page, headers=header) # 请求头必须写在 crawl 里,cookies 从 response.cookies 中找 @config(age=24 * 60 * 60) def index_page(self, response): header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" if response.cookies: header["Cookies"] = response.cookies
其中最重要的就是在每個回呼函數on_start,index_page 等等當中,每次呼叫時,都會實例化一個header 選擇器,給每個請求添加不一樣的header。要注意新增的如下程式碼:
header_slt = HeadersSelector() header = header_slt.select_header() # 获取一个新的 header # header["X-Requested-With"] = "XMLHttpRequest" header["Host"] = "www.baidu.com" if response.cookies: header["Cookies"] = response.cookies
當使用XHR 發送AJAX 請求時會帶上Header,常被用來判斷是不是Ajax 請求, headers 要新增{'X-Requested-With': 'XMLHttpRequest'} 才能抓取到內容。
確定了 url 也確定了請求頭中的 Host,需要按需添加,urlparse套件裡給出了根據 url解析出 host的方法函數,直接呼叫netloc即可。
如果回應中有 cookie,就需要將 cookie 新增到請求頭中。
如果還有別的偽裝需求,就自行加入。
如此即可實現隨機請求頭,完畢。
相關推薦:
#
以上是Pyspider中給爬蟲偽造隨機請求頭的實例的詳細內容。更多資訊請關注PHP中文網其他相關文章!