search
HomeWeb Front-endJS TutorialNode crawler advanced - login
Node crawler advanced - loginApr 04, 2017 am 10:19 AM
node

In the previous article node entry scenario - crawler, we have introduced the simplest node crawler implementation. This article goes one step further on the original basis and discusses how to bypass the login and crawl the data in the login area.

Contents

  • Theoretical basis

    • ##How to maintain login status

    • How does the browser do it

  • node implementation

    • Access

      httpLoginInterfaceGetcookie

    • Request the interface in the login area

  • If there is a

    verification codeHow to break it

  • EXTEND

  • ##Summary
  • 1. Theoretical basis

How to maintain login status

http as a A protocol without

status

. The client and server will not maintain a long connection. How can the server identify which interfaces are from the same client between independent requests and responses? You can easily think of the following mechanism:

Node crawler advanced - login
##session

Id.png

The core of this mechanism is Session id (sessionId):

When the client requests the server, the server determines that the client has not passed in the sessionId. Okay, this guy is new, generate it for it A sessionId is stored in the memory, and the sessionId is returned to the client.
  1. The client gets the sessionId from the server and saves it locally. It will bring this sessionId with the next request, and the server checks the memory. Does this sessionId exist (
  2. If in a previous step, the user accessed the login interface, then the seesionId is already
  3. key

    in the memory at this moment, and the user data is saved in the memory ), the server can return the data corresponding to the client based on the unique identifier of sessionId

    Whether the client or the server loses the sessionId, the previous steps will be repeated. No one knows anyone anymore, start over
  4. First the client establishes an association with the server through sessionId, and then the user establishes an association with the server through the client (key-value pair between sessionId and user data), thus maintaining the login state

How does the browser do it

In fact, does the browser follow the above What about the mechanism design? It really is!

##bs-sid.pngNode crawler advanced - login
What does the browser do:

1. The browser does In an http request, the cookie corresponding to the domain name of the request address will be added to the http request header (if the cookie is not disabled by the user). In the picture above, the first request to the server also has a cookie in the request header, but There is no sessionId

in the cookie yet. 2. The browser sets the cookie according to the

Set

-Cookie in the server response header. For this reason, the server will put the generated sessionId into Set-cookie

When the browser receives the Set-Cookie instruction, it will set a local cookie with the domain name of the request address as the key. Generally, when the server returns the Set-cookie, the expiration time of the sessionId is set to the browser to close by default. It expires when the browser is opened, which is why it is a session from opening to closing the browser (some websites can also be set to stay logged in and set cookies that will not expire for a long time)

3. When the browser opens again When a request is initiated in the background, the cookie in the request header already contains the sessionId. If the user has visited the login interface before, then the user data can be

queried

based on the sessionId

There is no proof, here is an example: 1). First use the login page opened by chr

ome, and find all the files under http://www.jianshu.com in the application Cookie, enter the Network item and check preserve log (otherwise you will not be able to see the previous log after the page is redirected)


LoginNode crawler advanced - login
2). Then refresh the page and find the sign-in interface. There are many Set-Cookies in its response header. Are there any

Node crawler advanced - login

Login

3). When you check the cookie again, the session-id has been saved, and you will request others next time When accessing the interface (such as obtaining verification code, logging in), this session-id will be brought. After logging in, the user's information will also be associated with the session-id

Node crawler advanced - login

Login

2. Node implementation

We need to simulate the working mode of the browser and crawl the data in the website login area
I found one without verification Test the website with the verification code. If there is a verification code, verification code identification is involved (the login is not considered, the complexity of the verification code is impressive). The next section explains

Access the login interface to obtain cookies

    // 浏览器请求报文头部部分信息
    var browserMsg={
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
        'Content-Type':'application/x-www-form-urlencoded'
    };

    //访问登录接口获取cookie
    function getLoginCookie(userid, pwd) {
        userid = userid.toUpperCase();
        return new Promise(function(resolve, reject) {
            superagent.post(url.login_url).set(browserMsg).send({
                userid: userid,
                pwd: pwd,
                timezoneOffset: '0'
            }).redirects(0).end(function (err, response) {
                //获取cookie
                var cookie = response.headers["set-cookie"];
                resolve(cookie);
            });
        });
    }
  1. You need to capture a request under chrome and obtain some request header information, because the server may verify these request header information. For example, on the website I experimented with, I did not pass in the User-Agent at first. The server found that the request was not from the server and returned a string of error messages, so I later set up the User-Agent and put I pretend to be a chrome browser~~

  2. superagent is a client-side HTTP request library. You can use it to easily send requests and process cookies (call http.request yourself to operate The header field data is not so convenient. After obtaining the set-cookie, you have to assemble it into a suitable format cookie). redirects(0) is mainly set not to redirect

Request the interface in the login area

    function getData(cookie) {
        return new Promise(function(resolve, reject) {
            //传入cookie
            superagent.get(url.target_url).set("Cookie",cookie).set(browserMsg).end(function(err,res) {
                var $ = cheerio.load(res.text);
                resolve({
                    cookie: cookie,
                    doc: $
                });
            });
        });
    }

After getting the set-cookie in the previous step, pass in the getData method , after setting it into the request through the superagent (set-cookie will be formatted into a cookie), you can get the login data normally

In the actual scenario, it may not be so smooth, because it is different Websites have different security measures. For example: some websites may need to request a token first, some websites need to encrypt parameters, and some with higher security also have anti-replay mechanisms. In directional crawlers, this requires a detailed analysis of the website's processing mechanism. If it cannot be circumvented, then enough is enough~~
But it is still enough to deal with general content information websites

What is requested through the above method is only a piece of html string. Here is the old method. Use the cheerio library to load the string, and you can get a ## similar to jquery dom. #Object, you can operate dom like jquery. This is really an artifact, made with conscience!

3. How to break the verification code if there is one?

How many websites can you log in to without entering the verification code? Of course, we won’t try to identify the verification code of 12306. We don’t expect such a conscientious verification code. Too young and too simple verification codes like Zhihu can still be challenged.

Node crawler advanced - login

Zhihu Login

Tesseract is Google's open source OCR recognition tool. Although it has nothing to do with node, it can be scheduled and used with node. The specific usage method: use

node.jsRealizing simple recognition of verification codes

However, even if graphicsmagick is used to preprocess

pictures, it cannot guarantee a high recognition rate, so it is still possible To train tesseract, refer to: Using the jTessBoxEditor tool to train Tesseract3.02.02 samples to improve the verification code recognition rate

Whether you can achieve a high recognition rate depends on your character~~~

4. Extension

There is a simpler way to bypass the login state, which is to use PhantomJS. Phantomjs is an open source server js based on webkit

api. It can be considered as a browser , but you can control it through js script.

Since it completely simulates the

behavior of the browser, you don’t need to care about set-cookie, cookie at all, you only need to simulate the user’s click operation (of course, if there is verification code, you still have to identify it)

This method is not without its shortcomings. It completely simulates the behavior of the browser, which means that it does not miss any request and needs to load js, css, and images that you may not need.Static For resources, you need to click on multiple pages to reach the destination page, which is less efficient than directly accessing the target url

Search if you are interestedSearchphontomJS

5. Summary

Although I am talking about the login of node crawler, I have talked about a lot of principles before. The purpose is that if you want to change the language to implement it, you can do it with ease. Still the same sentence: It is important to understand the principle

Welcome to leave a message for discussion. If it is helpful to you, please leave a like~~

The above is the detailed content of Node crawler advanced - login. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
node、nvm与npm有什么区别node、nvm与npm有什么区别Jul 04, 2022 pm 04:24 PM

node、nvm与npm的区别:1、nodejs是项目开发时所需要的代码库,nvm是nodejs版本管理工具,npm是nodejs包管理工具;2、nodejs能够使得javascript能够脱离浏览器运行,nvm能够管理nodejs和npm的版本,npm能够管理nodejs的第三方插件。

Vercel是什么?怎么部署Node服务?Vercel是什么?怎么部署Node服务?May 07, 2022 pm 09:34 PM

Vercel是什么?本篇文章带大家了解一下Vercel,并介绍一下在Vercel中部署 Node 服务的方法,希望对大家有所帮助!

node爬取数据实例:聊聊怎么抓取小说章节node爬取数据实例:聊聊怎么抓取小说章节May 02, 2022 am 10:00 AM

node怎么爬取数据?下面本篇文章给大家分享一个node爬虫实例,聊聊利用node抓取小说章节的方法,希望对大家有所帮助!

node导出模块有哪两种方式node导出模块有哪两种方式Apr 22, 2022 pm 02:57 PM

node导出模块的两种方式:1、利用exports,该方法可以通过添加属性的方式导出,并且可以导出多个成员;2、利用“module.exports”,该方法可以直接通过为“module.exports”赋值的方式导出模块,只能导出单个成员。

安装node时会自动安装npm吗安装node时会自动安装npm吗Apr 27, 2022 pm 03:51 PM

安装node时会自动安装npm;npm是nodejs平台默认的包管理工具,新版本的nodejs已经集成了npm,所以npm会随同nodejs一起安装,安装完成后可以利用“npm -v”命令查看是否安装成功。

聊聊V8的内存管理与垃圾回收算法聊聊V8的内存管理与垃圾回收算法Apr 27, 2022 pm 08:44 PM

本篇文章带大家了解一下V8引擎的内存管理与垃圾回收算法,希望对大家有所帮助!

node中是否包含dom和bomnode中是否包含dom和bomJul 06, 2022 am 10:19 AM

node中没有包含dom和bom;bom是指浏览器对象模型,bom是指文档对象模型,而node中采用ecmascript进行编码,并且没有浏览器也没有文档,是JavaScript运行在后端的环境平台,因此node中没有包含dom和bom。

聊聊Node.js path模块中的常用工具函数聊聊Node.js path模块中的常用工具函数Jun 08, 2022 pm 05:37 PM

本篇文章带大家聊聊Node.js中的path模块,介绍一下path的常见使用场景、执行机制,以及常用工具函数,希望对大家有所帮助!

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools