The content of this article is to introduce what is the puppeteer crawler? How crawlers work. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
What is a puppeteer?
Crawler is also called a network robot. Maybe you use search engines every day. Crawlers are an important part of search engines, crawling content for indexing. Nowadays, big data and data analysis are very popular. So where does the data come from? It can be crawled through web crawlers. Then let me discuss web crawlers.
The working principle of the crawler
As shown in the figure, this is the flow chart of the crawler. It can be seen that the crawling journey of the crawler is started through a seed URL. By downloading the web page, the content in the web page is parsed and stored. At the same time, the URL in the parsed web page is removed from duplication and added to the queue waiting to be crawled. Then get the next URL waiting to be crawled from the queue and repeat the above steps. Isn't it very simple?
Breadth (BFS) or depth (DFS) priority strategy
It is also mentioned above that after crawling a web page, wait for crawling Select a URL from the queue to crawl, so how to choose? Should you select the URL in the current crawled web page or continue to select the same level URL in the current URL? The same-level URL here refers to the URL from the same web page, which is the difference between crawling strategies.
Breadth First Strategy (BFS)
The breadth first strategy is to crawl the URL of a current web page completely first. Then crawl the URL crawled from the URL in the current web page. This is BFS. If the relationship diagram above represents the relationship between web pages, then the crawling strategy of BFS will be: (A->(B,D, F,G)->(C,F));
Depth First Strategy (DFS)
Depth First Strategy crawls a web page and then continues Crawl the URL parsed from the web page until the crawl is completed.
(A->B->C->D->E->F->G)
##Download page
Downloading a web page seems very simple, just like entering a link in the browser, and the browser will display it after the download is completed. Of course the result is not that simple.Simulated login
For some web pages, you need to log in to see the content on the web page. How does the crawler log in? In fact, the login process is to obtain the access credentials (cookie, token...)let cookie = ''; let j = request.jar() async function login() { if (cookie) { return await Promise.resolve(cookie); } return await new Promise((resolve, reject) => { request.post({ url: 'url', form: { m: 'username', p: 'password', }, jar: j }, function(err, res, body) { if (err) { reject(err); return; } cookie = j.getCookieString('url'); resolve(cookie); }) }) }Here is a simple chestnut, log in to obtain the cookie, and then bring the cookie with each request.
Get web content
Some web content is rendered on the server side. There is no CGI to obtain data and the content can only be parsed from html. However, the content of some websites is not simple. Obtaining content, websites like LinkedIn are not simply able to obtain web page content. The web page needs to be executed through the browser to obtain the final html structure. So how to solve it? I mentioned browser execution earlier, but do I have a programmable browser? Puppeteer, the open source headless browser project of the Google Chrome team, can use the headless browser to simulate user access, obtain the content of the most important web pages, and crawl the content.Use puppeteer to simulate login
async function login(username, password) { const browser = await puppeteer.launch(); page = await browser.newPage(); await page.setViewport({ width: 1400, height: 1000 }) await page.goto('https://example.cn/login'); console.log(page.url()) await page.focus('input[type=text]'); await page.type(username, { delay: 100 }); await page.focus('input[type=password]'); await page.type(password, { delay: 100 }); await page.$eval("input[type=submit]", el => el.click()); await page.waitForNavigation(); return page; }After executing
login(), you can get the content in html just like you logged in in the browser. , when letting w Oh Meng, you can also directly request CGI
async function crawlData(index, data) { let dataUrl = `https://example.cn/company/contacts?count=20&page=${index}&query=&dist=0&cid=${cinfo.cid}&company=${cinfo.encodename}&forcomp=1&searchTokens=&highlight=false&school=&me=&webcname=&webcid=&jsononly=1`; await page.goto(dataUrl); // ... }Like some websites, the cookie will be the same every time you crawl it. You can also use a headless browser to crawl it, so you don’t have to crawl it every time. Worry about cookies every time you crawl.
Write at the end
Of course, crawlers are not only about these, but also analyze the website. , find a suitable crawler strategy. Regardingpuppeteer, it can not only be used for crawlers, because it can be programmed, a headless browser, and can be used for automated testing and so on.
The above is the detailed content of What is the puppeteer crawler? How crawlers work. For more information, please follow other related articles on the PHP Chinese website!

去掉重复并排序的方法:1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句,去掉数组中的重复元素,返回去重后的新数组;2、利用sort()对去重数组进行排序,语法“去重数组.sort()”。

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题,包括了Symbol类型的描述、Symbol不会隐式转字符串等问题,下面一起来看一下,希望对大家有帮助。

怎么制作文字轮播与图片轮播?大家第一想到的是不是利用js,其实利用纯CSS也能实现文字轮播与图片轮播,下面来看看实现方法,希望对大家有所帮助!

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于对象的构造函数和new操作符,构造函数是所有对象的成员方法中,最早被调用的那个,下面一起来看一下吧,希望对大家有帮助。

方法:1、利用“点击元素对象.unbind("click");”方法,该方法可以移除被选元素的事件处理程序;2、利用“点击元素对象.off("click");”方法,该方法可以移除通过on()方法添加的事件处理程序。

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于面向对象的相关问题,包括了属性描述符、数据描述符、存取描述符等等内容,下面一起来看一下,希望对大家有帮助。

foreach不是es6的方法。foreach是es3中一个遍历数组的方法,可以调用数组的每个元素,并将元素传给回调函数进行处理,语法“array.forEach(function(当前元素,索引,数组){...})”;该方法不处理空数组。

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于BOM操作的相关问题,包括了window对象的常见事件、JavaScript执行机制等等相关内容,下面一起来看一下,希望对大家有帮助。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version
Chinese version, very easy to use

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version
Visual web development tools