Rumah >hujung hadapan web >tutorial js >通过网页爬虫中cookie自动获取及过期自动更新（详细教程）

通过网页爬虫中cookie自动获取及过期自动更新（详细教程）

亚连asal: 2018-06-01 10:02:097858semak imbas

这篇文章主要介绍了网页爬虫之cookie自动获取及过期自动更新的实现方法,需要的朋友可以参考下

本文实现cookie的自动获取，及cookie过期自动更新。

社交网站中的很多信息需要登录才能获取到，以微博为例，不登录账号，只能看到大V的前十条微博。保持登录状态，必须要用到Cookie。以登录www.weibo.cn 为例：

在chrome中输入：http://login.weibo.cn/login/

分析控制台的Headers的请求返回，会看到weibo.cn有几组返回的cookie。

实现步骤：

1，采用selenium自动登录获取cookie，保存到文件;

2，读取cookie，比较cookie的有效期，若过期则再次执行步骤1；

3，在请求其他网页时，填入cookie，实现登录状态的保持。

1，在线获取cookie

采用selenium + PhantomJS 模拟浏览器登录，获取cookie；

cookies一般会有多个，逐个将cookie存入以.weibo后缀的文件。

def get_cookie_from_network():
 from selenium import webdriver
 url_login = &#39;http://login.weibo.cn/login/&#39; 
 driver = webdriver.PhantomJS()
 driver.get(url_login)
 driver.find_element_by_xpath(&#39;//input[@type="text"]&#39;).send_keys(&#39;your_weibo_accout&#39;) # 改成你的微博账号
 driver.find_element_by_xpath(&#39;//input[@type="password"]&#39;).send_keys(&#39;your_weibo_password&#39;) # 改成你的微博密码
 driver.find_element_by_xpath(&#39;//input[@type="submit"]&#39;).click() # 点击登录
 # 获得 cookie信息
 cookie_list = driver.get_cookies()
 print cookie_list
 cookie_dict = {}
 for cookie in cookie_list:
  #写入文件
  f = open(cookie[&#39;name&#39;]+&#39;.weibo&#39;,&#39;w&#39;)
  pickle.dump(cookie, f)
  f.close()
  if cookie.has_key(&#39;name&#39;) and cookie.has_key(&#39;value&#39;):
   cookie_dict[cookie[&#39;name&#39;]] = cookie[&#39;value&#39;]
 return cookie_dict

2，从文件中获取cookie

从当前目录中遍历以.weibo结尾的文件，即cookie文件。采用pickle解包成dict，比较expiry值与当前时间，若过期则返回为空；

def get_cookie_from_cache():
 cookie_dict = {}
 for parent, dirnames, filenames in os.walk(&#39;./&#39;):
  for filename in filenames:
   if filename.endswith(&#39;.weibo&#39;):
    print filename
    with open(self.dir_temp + filename, &#39;r&#39;) as f:
     d = pickle.load(f)
     if d.has_key(&#39;name&#39;) and d.has_key(&#39;value&#39;) and d.has_key(&#39;expiry&#39;):
      expiry_date = int(d[&#39;expiry&#39;])
      if expiry_date > (int)(time.time()):
       cookie_dict[d[&#39;name&#39;]] = d[&#39;value&#39;]
      else:
       return {}
 return cookie_dict

3，若缓存cookie过期，则再次从网络获取cookie

def get_cookie():
 cookie_dict = get_cookie_from_cache()
 if not cookie_dict:
  cookie_dict = get_cookie_from_network()
 return cookie_dict

4，带cookie请求微博其他主页

def get_weibo_list(self, user_id):
 import requests
 from bs4 import BeautifulSoup as bs
 cookdic = get_cookie()
 url = &#39;http://weibo.cn/stocknews88&#39; 
 headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36&#39;}
 timeout = 5
 r = requests.get(url, headers=headers, cookies=cookdic,timeout=timeout)
 soup = bs(r.text, &#39;lxml&#39;)
 ...
 # 用BeautifulSoup 解析网页
 ...

上面是我整理给大家的，希望今后会对大家有帮助。

使用vue中的v-for遍历二维数组的方法

Vue中v-for的数据分组实例

vue2.0 computed 计算list循环后累加值的实例

Atas ialah kandungan terperinci 通过网页爬虫中cookie自动获取及过期自动更新（详细教程）. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Kenyataan：

Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn

Artikel sebelumnya：根据webpack配置中导致字体图标无法显示的解决方法（详细教程）Artikel seterusnya：React中使用BootStrap用户体验框架（详细教程）

Artikel berkaitan

Lihat lagi