What is the puppeteer crawler? How crawlers work-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

What is the puppeteer crawler? How crawlers work

青灯夜游

Nov 19, 2018 pm 05:58 PM

javascriptweb crawler

The content of this article is to introduce what is the puppeteer crawler? How crawlers work. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

What is a puppeteer?

Crawler is also called a network robot. Maybe you use search engines every day. Crawlers are an important part of search engines, crawling content for indexing. Nowadays, big data and data analysis are very popular. So where does the data come from? It can be crawled through web crawlers. Then let me discuss web crawlers.

What is the puppeteer crawler? How crawlers work

The working principle of the crawler

As shown in the figure, this is the flow chart of the crawler. It can be seen that the crawling journey of the crawler is started through a seed URL. By downloading the web page, the content in the web page is parsed and stored. At the same time, the URL in the parsed web page is removed from duplication and added to the queue waiting to be crawled. Then get the next URL waiting to be crawled from the queue and repeat the above steps. Isn't it very simple?

Breadth (BFS) or depth (DFS) priority strategy

It is also mentioned above that after crawling a web page, wait for crawling Select a URL from the queue to crawl, so how to choose? Should you select the URL in the current crawled web page or continue to select the same level URL in the current URL? The same-level URL here refers to the URL from the same web page, which is the difference between crawling strategies.

What is the puppeteer crawler? How crawlers work

Breadth First Strategy (BFS)

The breadth first strategy is to crawl the URL of a current web page completely first. Then crawl the URL crawled from the URL in the current web page. This is BFS. If the relationship diagram above represents the relationship between web pages, then the crawling strategy of BFS will be: (A->(B,D, F,G)->(C,F));

Depth First Strategy (DFS)

Depth First Strategy crawls a web page and then continues Crawl the URL parsed from the web page until the crawl is completed.
（A->B->C->D->E->F->G)

##Download page

Downloading a web page seems very simple, just like entering a link in the browser, and the browser will display it after the download is completed. Of course the result is not that simple.

Simulated login

For some web pages, you need to log in to see the content on the web page. How does the crawler log in? In fact, the login process is to obtain the access credentials (cookie, token...)

let cookie = '';
let j = request.jar()
async function login() {
    if (cookie) {
        return await Promise.resolve(cookie);
    }
    return await new Promise((resolve, reject) => {
        request.post({
            url: 'url',
            form: {
                m: 'username',
                p: 'password',
            },
            jar: j
        }, function(err, res, body) {
            if (err) {
                reject(err);
                return;
            }
            cookie = j.getCookieString('url');
            resolve(cookie);
        })
    })
}

Here is a simple chestnut, log in to obtain the cookie, and then bring the cookie with each request.

Get web content

Some web content is rendered on the server side. There is no CGI to obtain data and the content can only be parsed from html. However, the content of some websites is not simple. Obtaining content, websites like LinkedIn are not simply able to obtain web page content. The web page needs to be executed through the browser to obtain the final html structure. So how to solve it? I mentioned browser execution earlier, but do I have a programmable browser? Puppeteer, the open source headless browser project of the Google Chrome team, can use the headless browser to simulate user access, obtain the content of the most important web pages, and crawl the content.

Use puppeteer to simulate login

async function login(username, password) {
    const browser = await puppeteer.launch();
    page = await browser.newPage();
    await page.setViewport({
        width: 1400,
        height: 1000
    })
    await page.goto('https://example.cn/login');
    console.log(page.url())
    await page.focus('input[type=text]');
    await page.type(username, { delay: 100 });
    await page.focus('input[type=password]');
    await page.type(password, { delay: 100 });
    await page.$eval("input[type=submit]", el => el.click());
    await page.waitForNavigation();
    return page;
}

After executing

login(), you can get the content in html just like you logged in in the browser. , when letting w Oh Meng, you can also directly request CGI

async function crawlData(index, data) {
                    let dataUrl = `https://example.cn/company/contacts?count=20&page=${index}&query=&dist=0&cid=${cinfo.cid}&company=${cinfo.encodename}&forcomp=1&searchTokens=&highlight=false&school=&me=&webcname=&webcid=&jsononly=1`;
                    await page.goto(dataUrl);
                    // ...
                }

Like some websites, the cookie will be the same every time you crawl it. You can also use a headless browser to crawl it, so you don’t have to crawl it every time. Worry about cookies every time you crawl.

Write at the end

Of course, crawlers are not only about these, but also analyze the website. , find a suitable crawler strategy. Regarding

puppeteer, it can not only be used for crawlers, because it can be programmed, a headless browser, and can be used for automated testing and so on.

The above is the detailed content of What is the puppeteer crawler? How crawlers work. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:segmentfault 思否. If there is any infringement, please contact admin@php.cn delete

es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

去掉重复并排序的方法：1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句，去掉数组中的重复元素，返回去重后的新数组；2、利用sort()对去重数组进行排序，语法“去重数组.sort()”。

JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题，包括了Symbol类型的描述、Symbol不会隐式转字符串等问题，下面一起来看一下，希望对大家有帮助。

原来利用纯CSS也能实现文字轮播与图片轮播！Jun 10, 2022 pm 01:00 PM

怎么制作文字轮播与图片轮播？大家第一想到的是不是利用js，其实利用纯CSS也能实现文字轮播与图片轮播，下面来看看实现方法，希望对大家有所帮助！

JavaScript对象的构造函数和new操作符（实例详解）May 10, 2022 pm 06:16 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于对象的构造函数和new操作符，构造函数是所有对象的成员方法中，最早被调用的那个，下面一起来看一下吧，希望对大家有帮助。

javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

方法：1、利用“点击元素对象.unbind("click");”方法，该方法可以移除被选元素的事件处理程序；2、利用“点击元素对象.off("click");”方法，该方法可以移除通过on()方法添加的事件处理程序。

JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于面向对象的相关问题，包括了属性描述符、数据描述符、存取描述符等等内容，下面一起来看一下，希望对大家有帮助。

foreach是es6里的吗May 05, 2022 pm 05:59 PM

foreach不是es6的方法。foreach是es3中一个遍历数组的方法，可以调用数组的每个元素，并将元素传给回调函数进行处理，语法“array.forEach(function(当前元素,索引,数组){...})”；该方法不处理空数组。

整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于BOM操作的相关问题，包括了window对象的常见事件、JavaScript执行机制等等相关内容，下面一起来看一下，希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks agoByDDD

Hot Tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version

Chinese version, very easy to use

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

Hot Topics

Where is the login entrance for gmail email?

7419

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent