search
HomeBackend DevelopmentPython TutorialPython uses the Srapy framework crawler to simulate login and crawl Zhihu content

1. Cookie Principle
HTTP is a stateless connection-oriented protocol. In order to maintain the connection state, a Cookie mechanism is introduced
Cookie is an attribute in the http message header, including:

  • Cookie name (Name) Cookie value (Value)
  • Cookie expiration time (Expires/Max-Age)
  • Cookie action path (Path)
  • The domain name where the cookie is located (Domain), use cookies for secure connection (Secure)

The first two parameters are necessary conditions for cookie application. In addition, they also include Cookie size (Size, different browsers have different restrictions on the number and size of Cookies).

2. Simulated login
The main website crawled this time is Zhihu
You need to log in to crawl Zhihu. Form submission can be easily implemented through the previous python built-in library.

Now let’s take a look at how to implement form submission through Scrapy.

First check the form results when logging in. It is still the same as the technique used before. I deliberately entered the wrong password and captured the login web page header and form (I used the Network function in the developer tools that comes with Chrome)

201672182940777.png (702×170)

Looking at the captured form, you can see that it has four parts:

  • The email and password are the email and password for personal login
  • The rememberme field indicates whether to remember the account
  • The first field is _xsrf, which is guessed to be a verification mechanism
  • Now only _xsrf doesn’t know. I guess this verification field will definitely be sent when requesting the web page, so let’s check the source code of the current web page (right-click the mouse and view the web page source code, or use the shortcut key directly)

201672183128262.png (1788×782)

Find out our guess was correct

Then now you can write the form login function

def start_requests(self):
    return [Request("https://www.zhihu.com/login", callback = self.post_login)] #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数

  #FormRequeset
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  
              formdata = {
              '_xsrf': xsrf,
              'email': '123456',
              'password': '123456'
              },
              callback = self.after_login
              )]

The main functions are explained in the comments of the function
3. Saving Cookies
In order to continuously crawl the website using the same state, you need to save cookies and use cookies to save the state. Scrapy provides cookie processing middleware, which can be used directly

CookiesMiddleware:

This cookie middleware saves and tracks the cookie sent by the web server, and sends this cookie on the next request
The official Scrapy documentation gives the following code example:

for i, url in enumerate(urls):
  yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
    callback=self.parse_page)

def parse_page(self, response):
  # do some processing
  return scrapy.Request("http://www.example.com/otherpage",
    meta={'cookiejar': response.meta['cookiejar']},
    callback=self.parse_other_page)

Then we can modify the method in our crawler class to make it track cookies

  #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
  def start_requests(self):
    return [Request("https://www.zhihu.com/login", meta = {'cookiejar' : 1}, callback = self.post_login)] #添加了meta

  #FormRequeset出问题了
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  #"http://www.zhihu.com/login",
              meta = {'cookiejar' : response.meta['cookiejar']}, #注意这里cookie的获取
              headers = self.headers,
              formdata = {
              '_xsrf': xsrf,
              'email': '123456',
              'password': '123456'
              },
              callback = self.after_login,
              dont_filter = True
              )]

4. Disguise the head
Sometimes logging into a website requires header disguise, such as adding an anti-leeching header, or simulating server login

201672183151347.png (2136×604)

For insurance, we can fill in more fields in the header, as follows

  headers = {
  "Accept": "*/*",
  "Accept-Encoding": "gzip,deflate",
  "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
  "Connection": "keep-alive",
  "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
  "Referer": "http://www.zhihu.com/"
  }

In scrapy, both Request and FormRequest have a headers field when they are initialized. The headers can be customized, so we can add the headers field

Form the final version of the login function

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request, FormRequest
from zhihu.items import ZhihuItem



class ZhihuSipder(CrawlSpider) :
  name = "zhihu"
  allowed_domains = ["www.zhihu.com"]
  start_urls = [
    "http://www.zhihu.com"
  ]
  rules = (
    Rule(SgmlLinkExtractor(allow = ('/question/\d+#.*?', )), callback = 'parse_page', follow = True),
    Rule(SgmlLinkExtractor(allow = ('/question/\d+', )), callback = 'parse_page', follow = True),
  )
  headers = {
  "Accept": "*/*",
  "Accept-Encoding": "gzip,deflate",
  "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
  "Connection": "keep-alive",
  "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",
  "Referer": "http://www.zhihu.com/"
  }

  #重写了爬虫类的方法, 实现了自定义请求, 运行成功后会调用callback回调函数
  def start_requests(self):
    return [Request("https://www.zhihu.com/login", meta = {'cookiejar' : 1}, callback = self.post_login)]

  #FormRequeset出问题了
  def post_login(self, response):
    print 'Preparing login'
    #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
    #登陆成功后, 会调用after_login回调函数
    return [FormRequest.from_response(response,  #"http://www.zhihu.com/login",
              meta = {'cookiejar' : response.meta['cookiejar']},
              headers = self.headers, #注意此处的headers
              formdata = {
              '_xsrf': xsrf,
              'email': '1095511864@qq.com',
              'password': '123456'
              },
              callback = self.after_login,
              dont_filter = True
              )]

  def after_login(self, response) :
    for url in self.start_urls :
      yield self.make_requests_from_url(url)

  def parse_page(self, response):
    problem = Selector(response)
    item = ZhihuItem()
    item['url'] = response.url
    item['name'] = problem.xpath('//span[@class="name"]/text()').extract()
    print item['name']
    item['title'] = problem.xpath('//h2[@class="zm-item-title zm-editable-content"]/text()').extract()
    item['description'] = problem.xpath('//div[@class="zm-editable-content"]/text()').extract()
    item['answer']= problem.xpath('//div[@class=" zm-editable-content clearfix"]/text()').extract()
    return item

5. Item class and crawl interval
Complete Zhihu crawler code link

from scrapy.item import Item, Field


class ZhihuItem(Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  url = Field() #保存抓取问题的url
  title = Field() #抓取问题的标题
  description = Field() #抓取问题的描述
  answer = Field() #抓取问题的答案
  name = Field() #个人用户的名称

Set the crawl interval. If the crawler crawls too quickly during the visit, the crawler mechanism of the website will be triggered. Set

in setting.py.
BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
DOWNLOAD_DELAY = 0.25  #设置下载间隔为250ms

For more settings, please view the official documentation

Catch the results (only a small part of them)

...
 'url': 'http://www.zhihu.com/question/20688855/answer/16577390'}
2014-12-19 23:24:15+0800 [zhihu] DEBUG: Crawled (200) <GET http://www.zhihu.com/question/20688855/answer/15861368> (referer: http://www.zhihu.com/question/20688855/answer/19231794)
[]
2014-12-19 23:24:15+0800 [zhihu] DEBUG: Scraped from <200 http://www.zhihu.com/question/20688855/answer/15861368>
  {'answer': [u'\u9009\u4f1a\u8ba1\u8fd9\u4e2a\u4e13\u4e1a\uff0c\u8003CPA\uff0c\u5165\u8d22\u52a1\u8fd9\u4e2a\u884c\u5f53\u3002\u8fd9\u4e00\u8def\u8d70\u4e0b\u6765\uff0c\u6211\u53ef\u4ee5\u5f88\u80af\u5b9a\u7684\u544a\u8bc9\u4f60\uff0c\u6211\u662f\u771f\u7684\u559c\u6b22\u8d22\u52a1\uff0c\u70ed\u7231\u8fd9\u4e2a\u884c\u4e1a\uff0c\u56e0\u6b64\u575a\u5b9a\u4e0d\u79fb\u5730\u5728\u8fd9\u4e2a\u884c\u4e1a\u4e2d\u8d70\u4e0b\u53bb\u3002',
        u'\u4e0d\u8fc7\u4f60\u8bf4\u6709\u4eba\u4ece\u5c0f\u5c31\u559c\u6b22\u8d22\u52a1\u5417\uff1f\u6211\u89c9\u5f97\u51e0\u4e4e\u6ca1\u6709\u5427\u3002\u8d22\u52a1\u7684\u9b45\u529b\u5728\u4e8e\u4f60\u771f\u6b63\u61c2\u5f97\u5b83\u4e4b\u540e\u3002',
        u'\u901a\u8fc7\u5b83\uff0c\u4f60\u53ef\u4ee5\u5b66\u4e60\u4efb\u4f55\u4e00\u79cd\u5546\u4e1a\u7684\u7ecf\u8425\u8fc7\u7a0b\uff0c\u4e86\u89e3\u5176\u7eb7\u7e41\u5916\u8868\u4e0b\u7684\u5b9e\u7269\u6d41\u3001\u73b0\u91d1\u6d41\uff0c\u751a\u81f3\u4f60\u53ef\u4ee5\u638c\u63e1\u5982\u4f55\u53bb\u7ecf\u8425\u8fd9\u79cd\u5546\u4e1a\u3002',
        u'\u5982\u679c\u5bf9\u4f1a\u8ba1\u7684\u8ba4\u8bc6\u4ec5\u4ec5\u505c\u7559\u5728\u505a\u5206\u5f55\u8fd9\u4e2a\u5c42\u9762\uff0c\u5f53\u7136\u4f1a\u89c9\u5f97\u67af\u71e5\u65e0\u5473\u3002\u5f53\u4f60\u5bf9\u5b83\u7684\u8ba4\u8bc6\u8fdb\u5165\u5230\u6df1\u5c42\u6b21\u7684\u65f6\u5019\uff0c\u4f60\u81ea\u7136\u5c31\u4f1a\u559c\u6b22\u4e0a\u5b83\u4e86\u3002\n\n\n'],
   'description': [u'\u672c\u4eba\u5b66\u4f1a\u8ba1\u6559\u80b2\u4e13\u4e1a\uff0c\u6df1\u611f\u5176\u67af\u71e5\u4e4f\u5473\u3002\n\u5f53\u521d\u662f\u51b2\u7740\u5e08\u8303\u4e13\u4e1a\u62a5\u7684\uff0c\u56e0\u4e3a\u68a6\u60f3\u662f\u6210\u4e3a\u4e00\u540d\u8001\u5e08\uff0c\u4f46\u662f\u611f\u89c9\u73b0\u5728\u666e\u901a\u521d\u9ad8\u4e2d\u8001\u5e08\u5df2\u7ecf\u8d8b\u4e8e\u9971\u548c\uff0c\u800c\u987a\u6bcd\u4eb2\u5927\u4eba\u7684\u610f\u9009\u4e86\u8fd9\u4e2a\u4e13\u4e1a\u3002\u6211\u559c\u6b22\u4e0a\u6559\u80b2\u5b66\u7684\u8bfe\uff0c\u5e76\u597d\u7814\u7a76\u5404\u79cd\u6559\u80b2\u5fc3\u7406\u5b66\u3002\u4f46\u4f1a\u8ba1\u8bfe\u4f3c\u4e4e\u662f\u4e3b\u6d41\u3001\u54ce\u3002\n\n\u4e00\u76f4\u4e0d\u559c\u6b22\u94b1\u4e0d\u94b1\u7684\u4e13\u4e1a\uff0c\u6240\u4ee5\u5f88\u597d\u5947\u5927\u5bb6\u9009\u4f1a\u8ba1\u4e13\u4e1a\u5230\u5e95\u662f\u51fa\u4e8e\u4ec0\u4e48\u76ee\u7684\u3002\n\n\u6bd4\u5982\u8bf4\u5b66\u4e2d\u6587\u7684\u4f1a\u8bf4\u4ece\u5c0f\u559c\u6b22\u770b\u4e66\uff0c\u4f1a\u6709\u4ece\u5c0f\u559c\u6b22\u4f1a\u8ba1\u501f\u554a\u8d37\u554a\u7684\u7684\u4eba\u5417\uff1f'],
   'name': [],
   'title': [u'\n\n', u'\n\n'],
   'url': 'http://www.zhihu.com/question/20688855/answer/15861368'}
...

6. Problems

  • Rule design cannot achieve full website crawling, but only sets up the crawling of simple questions
  • The Xpath setting is not rigorous and needs to be rethought
  • Unicode encoding should be converted to UTF-8

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How are arrays used in scientific computing with Python?How are arrays used in scientific computing with Python?Apr 25, 2025 am 12:28 AM

ArraysinPython,especiallyviaNumPy,arecrucialinscientificcomputingfortheirefficiencyandversatility.1)Theyareusedfornumericaloperations,dataanalysis,andmachinelearning.2)NumPy'simplementationinCensuresfasteroperationsthanPythonlists.3)Arraysenablequick

How do you handle different Python versions on the same system?How do you handle different Python versions on the same system?Apr 25, 2025 am 12:24 AM

You can manage different Python versions by using pyenv, venv and Anaconda. 1) Use pyenv to manage multiple Python versions: install pyenv, set global and local versions. 2) Use venv to create a virtual environment to isolate project dependencies. 3) Use Anaconda to manage Python versions in your data science project. 4) Keep the system Python for system-level tasks. Through these tools and strategies, you can effectively manage different versions of Python to ensure the smooth running of the project.

What are some advantages of using NumPy arrays over standard Python arrays?What are some advantages of using NumPy arrays over standard Python arrays?Apr 25, 2025 am 12:21 AM

NumPyarrayshaveseveraladvantagesoverstandardPythonarrays:1)TheyaremuchfasterduetoC-basedimplementation,2)Theyaremorememory-efficient,especiallywithlargedatasets,and3)Theyofferoptimized,vectorizedfunctionsformathematicalandstatisticaloperations,making

How does the homogenous nature of arrays affect performance?How does the homogenous nature of arrays affect performance?Apr 25, 2025 am 12:13 AM

The impact of homogeneity of arrays on performance is dual: 1) Homogeneity allows the compiler to optimize memory access and improve performance; 2) but limits type diversity, which may lead to inefficiency. In short, choosing the right data structure is crucial.

What are some best practices for writing executable Python scripts?What are some best practices for writing executable Python scripts?Apr 25, 2025 am 12:11 AM

TocraftexecutablePythonscripts,followthesebestpractices:1)Addashebangline(#!/usr/bin/envpython3)tomakethescriptexecutable.2)Setpermissionswithchmod xyour_script.py.3)Organizewithacleardocstringanduseifname=="__main__":formainfunctionality.4

How do NumPy arrays differ from the arrays created using the array module?How do NumPy arrays differ from the arrays created using the array module?Apr 24, 2025 pm 03:53 PM

NumPyarraysarebetterfornumericaloperationsandmulti-dimensionaldata,whilethearraymoduleissuitableforbasic,memory-efficientarrays.1)NumPyexcelsinperformanceandfunctionalityforlargedatasetsandcomplexoperations.2)Thearraymoduleismorememory-efficientandfa

How does the use of NumPy arrays compare to using the array module arrays in Python?How does the use of NumPy arrays compare to using the array module arrays in Python?Apr 24, 2025 pm 03:49 PM

NumPyarraysarebetterforheavynumericalcomputing,whilethearraymoduleismoresuitableformemory-constrainedprojectswithsimpledatatypes.1)NumPyarraysofferversatilityandperformanceforlargedatasetsandcomplexoperations.2)Thearraymoduleislightweightandmemory-ef

How does the ctypes module relate to arrays in Python?How does the ctypes module relate to arrays in Python?Apr 24, 2025 pm 03:45 PM

ctypesallowscreatingandmanipulatingC-stylearraysinPython.1)UsectypestointerfacewithClibrariesforperformance.2)CreateC-stylearraysfornumericalcomputations.3)PassarraystoCfunctionsforefficientoperations.However,becautiousofmemorymanagement,performanceo

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)