search
HomeBackend DevelopmentPython TutorialPython uses the Srapy framework crawler to simulate login and crawl Zhihu content

1. Cookie Principle
HTTP is a stateless connection-oriented protocol. In order to maintain the connection state, a Cookie mechanism is introduced
Cookie is an attribute in the http message header, including:

  • Cookie name (Name) Cookie value (Value)
  • Cookie expiration time (Expires/Max-Age)
  • Cookie action path (Path)
  • The domain name where the cookie is located (Domain), use cookies for secure connection (Secure)

The first two parameters are necessary conditions for cookie application. In addition, they also include Cookie size (Size, different browsers have different restrictions on the number and size of Cookies).

2. Simulated login
The main website crawled this time is Zhihu
You need to log in to crawl Zhihu. Form submission can be easily implemented through the previous python built-in library.

Now let’s take a look at how to implement form submission through Scrapy.

First check the form results when logging in. It is still the same as the technique used before. I deliberately entered the wrong password and captured the login web page header and form (I used the Network function in the developer tools that comes with Chrome)

201672182940777.png (702×170)

Looking at the captured form, you can see that it has four parts:

  • The email and password are the email and password for personal login
  • The rememberme field indicates whether to remember the account
  • The first field is _xsrf, which is guessed to be a verification mechanism
  • Now only _xsrf doesn’t know. I guess this verification field will definitely be sent when requesting the web page, so let’s check the source code of the current web page (right-click the mouse and view the web page source code, or use the shortcut key directly)

201672183128262.png (1788×782)

Find out our guess was correct

Then now you can write the form login function

def start_requests(self):
    return [Request("https://www.zhihu.com/login", callback = self.post_login)] #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数

  #FormRequeset
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  
              formdata = {
              '_xsrf': xsrf,
              'email': '123456',
              'password': '123456'
              },
              callback = self.after_login
              )]

The main functions are explained in the comments of the function
3. Saving Cookies
In order to continuously crawl the website using the same state, you need to save cookies and use cookies to save the state. Scrapy provides cookie processing middleware, which can be used directly

CookiesMiddleware:

This cookie middleware saves and tracks the cookie sent by the web server, and sends this cookie on the next request
The official Scrapy documentation gives the following code example:

for i, url in enumerate(urls):
  yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
    callback=self.parse_page)

def parse_page(self, response):
  # do some processing
  return scrapy.Request("http://www.example.com/otherpage",
    meta={'cookiejar': response.meta['cookiejar']},
    callback=self.parse_other_page)

Then we can modify the method in our crawler class to make it track cookies

  #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
  def start_requests(self):
    return [Request("https://www.zhihu.com/login", meta = {'cookiejar' : 1}, callback = self.post_login)] #添加了meta

  #FormRequeset出问题了
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  #"http://www.zhihu.com/login",
              meta = {'cookiejar' : response.meta['cookiejar']}, #注意这里cookie的获取
              headers = self.headers,
              formdata = {
              '_xsrf': xsrf,
              'email': '123456',
              'password': '123456'
              },
              callback = self.after_login,
              dont_filter = True
              )]

4. Disguise the head
Sometimes logging into a website requires header disguise, such as adding an anti-leeching header, or simulating server login

201672183151347.png (2136×604)

For insurance, we can fill in more fields in the header, as follows

  headers = {
  "Accept": "*/*",
  "Accept-Encoding": "gzip,deflate",
  "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
  "Connection": "keep-alive",
  "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
  "Referer": "http://www.zhihu.com/"
  }

In scrapy, both Request and FormRequest have a headers field when they are initialized. The headers can be customized, so we can add the headers field

Form the final version of the login function

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request, FormRequest
from zhihu.items import ZhihuItem



class ZhihuSipder(CrawlSpider) :
  name = "zhihu"
  allowed_domains = ["www.zhihu.com"]
  start_urls = [
    "http://www.zhihu.com"
  ]
  rules = (
    Rule(SgmlLinkExtractor(allow = ('/question/\d+#.*?', )), callback = 'parse_page', follow = True),
    Rule(SgmlLinkExtractor(allow = ('/question/\d+', )), callback = 'parse_page', follow = True),
  )
  headers = {
  "Accept": "*/*",
  "Accept-Encoding": "gzip,deflate",
  "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
  "Connection": "keep-alive",
  "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
  "Referer": "http://www.zhihu.com/"
  }

  #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
  def start_requests(self):
    return [Request("https://www.zhihu.com/login", meta = {'cookiejar' : 1}, callback = self.post_login)]

  #FormRequeset出问题了
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  #"http://www.zhihu.com/login",
              meta = {'cookiejar' : response.meta['cookiejar']},
              headers = self.headers, #注意此处的headers
              formdata = {
              '_xsrf': xsrf,
              'email': '1095511864@qq.com',
              'password': '123456'
              },
              callback = self.after_login,
              dont_filter = True
              )]

  def after_login(self, response) :
    for url in self.start_urls :
      yield self.make_requests_from_url(url)

  def parse_page(self, response):
    problem = Selector(response)
    item = ZhihuItem()
    item['url'] = response.url
    item['name'] = problem.xpath('//span[@class="name"]/text()').extract()
    print item['name']
    item['title'] = problem.xpath('//h2[@class="zm-item-title zm-editable-content"]/text()').extract()
    item['description'] = problem.xpath('//div[@class="zm-editable-content"]/text()').extract()
    item['answer']= problem.xpath('//div[@class=" zm-editable-content clearfix"]/text()').extract()
    return item

5. Item class and crawl interval
Complete Zhihu crawler code link

from scrapy.item import Item, Field


class ZhihuItem(Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  url = Field() #保存抓取问题的url
  title = Field() #抓取问题的标题
  description = Field() #抓取问题的描述
  answer = Field() #抓取问题的答案
  name = Field() #个人用户的名称

Set the crawl interval. If the crawler crawls too quickly during the visit, the crawler mechanism of the website will be triggered. Set

in setting.py.
BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
DOWNLOAD_DELAY = 0.25  #设置下载间隔为250ms

For more settings, please view the official documentation

Catch the results (only a small part of them)

...
 'url': 'http://www.zhihu.com/question/20688855/answer/16577390'}
2014-12-19 23:24:15+0800 [zhihu] DEBUG: Crawled (200) <GET http://www.zhihu.com/question/20688855/answer/15861368> (referer: http://www.zhihu.com/question/20688855/answer/19231794)
[]
2014-12-19 23:24:15+0800 [zhihu] DEBUG: Scraped from <200 http://www.zhihu.com/question/20688855/answer/15861368>
  {'answer': [u'\u9009\u4f1a\u8ba1\u8fd9\u4e2a\u4e13\u4e1a\uff0c\u8003CPA\uff0c\u5165\u8d22\u52a1\u8fd9\u4e2a\u884c\u5f53\u3002\u8fd9\u4e00\u8def\u8d70\u4e0b\u6765\uff0c\u6211\u53ef\u4ee5\u5f88\u80af\u5b9a\u7684\u544a\u8bc9\u4f60\uff0c\u6211\u662f\u771f\u7684\u559c\u6b22\u8d22\u52a1\uff0c\u70ed\u7231\u8fd9\u4e2a\u884c\u4e1a\uff0c\u56e0\u6b64\u575a\u5b9a\u4e0d\u79fb\u5730\u5728\u8fd9\u4e2a\u884c\u4e1a\u4e2d\u8d70\u4e0b\u53bb\u3002',
        u'\u4e0d\u8fc7\u4f60\u8bf4\u6709\u4eba\u4ece\u5c0f\u5c31\u559c\u6b22\u8d22\u52a1\u5417\uff1f\u6211\u89c9\u5f97\u51e0\u4e4e\u6ca1\u6709\u5427\u3002\u8d22\u52a1\u7684\u9b45\u529b\u5728\u4e8e\u4f60\u771f\u6b63\u61c2\u5f97\u5b83\u4e4b\u540e\u3002',
        u'\u901a\u8fc7\u5b83\uff0c\u4f60\u53ef\u4ee5\u5b66\u4e60\u4efb\u4f55\u4e00\u79cd\u5546\u4e1a\u7684\u7ecf\u8425\u8fc7\u7a0b\uff0c\u4e86\u89e3\u5176\u7eb7\u7e41\u5916\u8868\u4e0b\u7684\u5b9e\u7269\u6d41\u3001\u73b0\u91d1\u6d41\uff0c\u751a\u81f3\u4f60\u53ef\u4ee5\u638c\u63e1\u5982\u4f55\u53bb\u7ecf\u8425\u8fd9\u79cd\u5546\u4e1a\u3002',
        u'\u5982\u679c\u5bf9\u4f1a\u8ba1\u7684\u8ba4\u8bc6\u4ec5\u4ec5\u505c\u7559\u5728\u505a\u5206\u5f55\u8fd9\u4e2a\u5c42\u9762\uff0c\u5f53\u7136\u4f1a\u89c9\u5f97\u67af\u71e5\u65e0\u5473\u3002\u5f53\u4f60\u5bf9\u5b83\u7684\u8ba4\u8bc6\u8fdb\u5165\u5230\u6df1\u5c42\u6b21\u7684\u65f6\u5019\uff0c\u4f60\u81ea\u7136\u5c31\u4f1a\u559c\u6b22\u4e0a\u5b83\u4e86\u3002\n\n\n'],
   'description': [u'\u672c\u4eba\u5b66\u4f1a\u8ba1\u6559\u80b2\u4e13\u4e1a\uff0c\u6df1\u611f\u5176\u67af\u71e5\u4e4f\u5473\u3002\n\u5f53\u521d\u662f\u51b2\u7740\u5e08\u8303\u4e13\u4e1a\u62a5\u7684\uff0c\u56e0\u4e3a\u68a6\u60f3\u662f\u6210\u4e3a\u4e00\u540d\u8001\u5e08\uff0c\u4f46\u662f\u611f\u89c9\u73b0\u5728\u666e\u901a\u521d\u9ad8\u4e2d\u8001\u5e08\u5df2\u7ecf\u8d8b\u4e8e\u9971\u548c\uff0c\u800c\u987a\u6bcd\u4eb2\u5927\u4eba\u7684\u610f\u9009\u4e86\u8fd9\u4e2a\u4e13\u4e1a\u3002\u6211\u559c\u6b22\u4e0a\u6559\u80b2\u5b66\u7684\u8bfe\uff0c\u5e76\u597d\u7814\u7a76\u5404\u79cd\u6559\u80b2\u5fc3\u7406\u5b66\u3002\u4f46\u4f1a\u8ba1\u8bfe\u4f3c\u4e4e\u662f\u4e3b\u6d41\u3001\u54ce\u3002\n\n\u4e00\u76f4\u4e0d\u559c\u6b22\u94b1\u4e0d\u94b1\u7684\u4e13\u4e1a\uff0c\u6240\u4ee5\u5f88\u597d\u5947\u5927\u5bb6\u9009\u4f1a\u8ba1\u4e13\u4e1a\u5230\u5e95\u662f\u51fa\u4e8e\u4ec0\u4e48\u76ee\u7684\u3002\n\n\u6bd4\u5982\u8bf4\u5b66\u4e2d\u6587\u7684\u4f1a\u8bf4\u4ece\u5c0f\u559c\u6b22\u770b\u4e66\uff0c\u4f1a\u6709\u4ece\u5c0f\u559c\u6b22\u4f1a\u8ba1\u501f\u554a\u8d37\u554a\u7684\u7684\u4eba\u5417\uff1f'],
   'name': [],
   'title': [u'\n\n', u'\n\n'],
   'url': 'http://www.zhihu.com/question/20688855/answer/15861368'}
...

6. Problems

  • Rule design cannot achieve full website crawling, but only sets up the crawling of simple questions
  • The Xpath setting is not rigorous and needs to be rethought
  • Unicode encoding should be converted to UTF-8

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
详细讲解Python之Seaborn(数据可视化)详细讲解Python之Seaborn(数据可视化)Apr 21, 2022 pm 06:08 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于Seaborn的相关问题,包括了数据可视化处理的散点图、折线图、条形图等等内容,下面一起来看一下,希望对大家有帮助。

详细了解Python进程池与进程锁详细了解Python进程池与进程锁May 10, 2022 pm 06:11 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于进程池与进程锁的相关问题,包括进程池的创建模块,进程池函数等等内容,下面一起来看一下,希望对大家有帮助。

Python自动化实践之筛选简历Python自动化实践之筛选简历Jun 07, 2022 pm 06:59 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于简历筛选的相关问题,包括了定义 ReadDoc 类用以读取 word 文件以及定义 search_word 函数用以筛选的相关内容,下面一起来看一下,希望对大家有帮助。

归纳总结Python标准库归纳总结Python标准库May 03, 2022 am 09:00 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于标准库总结的相关问题,下面一起来看一下,希望对大家有帮助。

分享10款高效的VSCode插件,总有一款能够惊艳到你!!分享10款高效的VSCode插件,总有一款能够惊艳到你!!Mar 09, 2021 am 10:15 AM

VS Code的确是一款非常热门、有强大用户基础的一款开发工具。本文给大家介绍一下10款高效、好用的插件,能够让原本单薄的VS Code如虎添翼,开发效率顿时提升到一个新的阶段。

Python数据类型详解之字符串、数字Python数据类型详解之字符串、数字Apr 27, 2022 pm 07:27 PM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于数据类型之字符串、数字的相关问题,下面一起来看一下,希望对大家有帮助。

详细介绍python的numpy模块详细介绍python的numpy模块May 19, 2022 am 11:43 AM

本篇文章给大家带来了关于Python的相关知识,其中主要介绍了关于numpy模块的相关问题,Numpy是Numerical Python extensions的缩写,字面意思是Python数值计算扩展,下面一起来看一下,希望对大家有帮助。

python中文是什么意思python中文是什么意思Jun 24, 2019 pm 02:22 PM

pythn的中文意思是巨蟒、蟒蛇。1989年圣诞节期间,Guido van Rossum在家闲的没事干,为了跟朋友庆祝圣诞节,决定发明一种全新的脚本语言。他很喜欢一个肥皂剧叫Monty Python,所以便把这门语言叫做python。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment