Home > Article > Backend Development > Python3 crawler brings cookies
The original meaning of cookie in English is "snack". It is the information stored by the server on the client's hard disk when the client accesses the Web server. It seems to be the "snack" sent by the server to the client. ". The server can track customer status based on cookies, which is particularly useful for occasions where customers need to be distinguished (such as e-commerce).
When the client requests access to the server for the first time, the server first stores a cookie containing the client's relevant information on the client. Every time the client requests access to the server in the future, the cookie will be included in the HTTP request data. The server By parsing the cookie in the HTTP request, you can obtain relevant information about the customer.
Let’s take a look at how to bring cookies to python3 crawlers:
1. Write Cookie directly in the header
# coding:utf-8 import requests from bs4 import BeautifulSoup cookie = '''cisession=19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60;CNZZDATA1000201968=181584 6425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483922031;Hm_lvt_f805f7762a9a2 37a0deac37015e9f6d9=1482722012,1483926313;Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9=14839 26368''' header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Geck o) Chrome/53.0.2785.143 Safari/537.36', 'Connection': 'keep-alive', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Cookie': cookie} url = 'https://www.jb51.net/article/191947.htm' wbdata = requests.get(url,headers=header).text soup = BeautifulSoup(wbdata,'lxml') print(soup)
2. Use requests Insert Cookie
# coding:utf-8 import requests from bs4 import BeautifulSoup cookie = { "cisession":"19dfd70a27ec0eecf1fe3fc2e48b7f91c7c83c60", "CNZZDATA100020196":"1815846425-1478580135-https%253A%252F%252Fwww.baidu.com%252F%7C1483 922031", "Hm_lvt_f805f7762a9a237a0deac37015e9f6d9":"1482722012,1483926313", "Hm_lpvt_f805f7762a9a237a0deac37015e9f6d9":"1483926368" } url = 'https://www.jb51.net/article/191947.htm' wbdata = requests.get(url,cookies=cookie).text soup = BeautifulSoup(wbdata,'lxml') print(soup)
Example extension:
Use cookies to log in to the Harbin Institute of Technology ACM site
Get the site login address
http:// acm.hit.edu.cn/hoj/system/login
View the post data to be transmitted
user and password
Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- """ __author__ = 'pi' __email__ = 'pipisorry@126.com' """ import urllib.request, urllib.parse, urllib.error import http.cookiejar LOGIN_URL = 'http://acm.hit.edu.cn/hoj/system/login' values = {'user': '******', 'password': '******'} # , 'submit' : 'Login' postdata = urllib.parse.urlencode(values).encode() user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36' headers = {'User-Agent': user_agent, 'Connection': 'keep-alive'} cookie_filename = 'cookie.txt' cookie = http.cookiejar.MozillaCookieJar(cookie_filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) request = urllib.request.Request(LOGIN_URL, postdata, headers) try: response = opener.open(request) page = response.read().decode() # print(page) except urllib.error.URLError as e: print(e.code, ':', e.reason) cookie.save(ignore_discard=True, ignore_expires=True) # 保存cookie到cookie.txt中 print(cookie) for item in cookie: print('Name = ' + item.name) print('Value = ' + item.value) get_url = 'http://acm.hit.edu.cn/hoj/problem/solution/?problem=1' # 利用cookie请求訪问还有一个网址 get_request = urllib.request.Request(get_url, headers=headers) get_response = opener.open(get_request) print(get_response.read().decode()) # print('You have not solved this problem' in get_response.read().decode())
Recommended tutorial: " Python Tutorial》
The above is the detailed content of Python3 crawler brings cookies. For more information, please follow other related articles on the PHP Chinese website!