search
HomeWeb Front-endJS TutorialImplementation code for puppeteer simulated login and capture page

This time I will bring you the implementation code of puppeteer's simulated login and capture page. What are the precautions for the implementation of puppeteer's simulated login and capture page. The following is a practical case, let's take a look.

About heat map

In the website analysis industry, website heat map can well reflect the user's operating behavior on the website. Specific analysis User preferences, targeted optimization of the website, an example of a heat map (from ptengine)

You can clearly see in the above picture where the user’s focus is Well, we do not pay attention to the function of the heat map in the product. This article will make a simple analysis and summary of the implementation of the heat map.

Mainstream implementation methods of heat maps

Generally, the following stages are required to implement heat map display: 1. Obtain the website Page
2. Obtain processed user data
3. Draw heat map
This article mainly focuses on stage 1 to introduce in detail the mainstream implementation method of obtaining website pages in heat map
4. Use iframe to directly embed the user website
5. Grab the user page and save it locally, and embed local resources through iframe (the so-called local resources are considered to be the analysis tool side)

There are two ways. Advantages and Disadvantages of Each

First of all, the first one is to embed directly into the user website. This has certain restrictions. For example, if the user website does not allow iframe nesting in order to prevent iframe hijacking (set meta X-FRAME-OPTIONS to sameorgin Or set the http header directly, or even control it directly through js if(window.top !== window.self){ window.top.location = window.location;} ), in this case you need The client's website has to do some work before it can be loaded by the iframe of the analysis tool, and it may not be so convenient to use because not all website users who need to detect and analyze the website can manage the website.

The second method is to directly capture the website page to the local server, and then browse the page captured on the local server. In this case, the page has already come over, and we can do whatever we want. First, we Bypassing the problem that X-FRAME-OPTIONS is sameorgin, we only need to solve the problem of js control. For the captured pages, we can handle it through special correspondence (such as removing the corresponding js control, or adding our own js); however, this method also has many shortcomings: 1. It cannot crawl the spa page, cannot crawl the page that requires user login authorization, cannot crawl the page where the user has set a clear setting, etc.

Both methods have https and http resources. Another problem caused by the same-origin policy is that the https station cannot load http resources, so for the best compatibility, the heat map analysis tool needs to be applied with the http protocol. , of course, specific sub-station optimization can be carried out according to the customer websites visited.

How to optimize the crawling of website pages

Here we do some optimization based on puppeteer to improve the crawling of the problems encountered in crawling website pages The probability of success mainly optimizes the following two pages:

1.spa page

The spa page is considered mainstream on the current page, but it is always well-known for its Unfriendly to search engines; the usual page crawler is actually a simple crawler, and the process usually involves initiating an http get request to the user's website (should be the user's website server). This crawling method itself has problems. First of all, the direct request is to the user server. The user server should have many restrictions on non-browser agents and need to be bypassed. Secondly, the request returns the original content, which needs to be bypassed. The part rendered through js in the browser cannot be obtained (of course, after the iframe is embedded, js execution will still make up for this problem to a certain extent). Finally, if the page is a spa page, then only the template is obtained at this time. In the heat map The display effect is very unfriendly. In response to this situation, if it is done based on puppeteer, the process becomes

puppeteer starts the browser to open the user website-->page rendering-->returns the rendered result. Simply use pseudo code to implement it as follows:

const puppeteer = require('puppeteer');
async getHtml = (url) =>{
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  return await page.content();
}

In this way, the content we get is the rendered content, regardless of the rendering method of the page (client-side rendering or server-side rendering)

Pages that require login

There are actually many situations for pages that require login:

需要登录才可以查看页面,如果没有登录,则跳转到login页面(各种管理系统)

对于这种类型的页面我们需要做的就是模拟登录,所谓模拟登录就是让浏览器去登录,这里需要用户提供对应网站的用户名和密码,然后我们走如下的流程:

访问用户网站-->用户网站检测到未登录跳转到login-->puppeteer控制浏览器自动登录后跳转到真正需要抓取的页面,可用如下伪代码来说明:

const puppeteer = require("puppeteer");
async autoLogin =(url)=>{
   const browser = await puppeteer.launch();
   const page =await browser.newPage();
   await page.goto(url);
   await page.waitForNavigation();
   //登录
   await page.type('#username',"用户提供的用户名");
   await page.type('#password','用户提供的密码');
   await page.click('#btn_login');
  //页面登录成功后,需要保证redirect 跳转到请求的页面
   await page.waitForNavigation();
   return await page.content();
}

登录与否都可以查看页面,只是登录后看到内容会所有不同 (各种电商或者portal页面)

这种情况处理会比较简单一些,可以简单的认为是如下步骤:

通过puppeteer启动浏览器打开请求页面-->点击登录按钮-->输入用户名和密码登录 -->重新加载页面

基本代码如下图:

const puppeteer = require("puppeteer");
async autoLoginV2 =(url)=>{
   const browser = await puppeteer.launch();
   const page =await browser.newPage();
   await page.goto(url);
   await page.click('#btn_show_login');
   //登录
   await page.type('#username',"用户提供的用户名");
   await page.type('#password','用户提供的密码');
   await page.click('#btn_login');
  //页面登录成功后,是否需要reload 根据实际情况来确定
   await page.reload();
   return await page.content();
}

总结

明天总结吧,今天下班了。

补充(还昨天的债):基于puppeteer虽然可以很友好的抓取页面内容,但是也存在这很多的局限

1.抓取的内容为渲染后的原始html,即资源路径(css、image、javascript)等都是相对路径,保存到本地后无法正常显示,需要特殊处理(js不需要特殊处理,甚至可以移除,因为渲染的结构已经完成)

2.通过puppeteer抓取页面性能会比直接http get 性能会差一些,因为多了渲染的过程

3.同样无法保证页面的完整性,只是很大的提高了完整的概率,虽然通过page对象提供的各种wait 方法能够解决这个问题,但是网站不同,处理方式就会不同,无法复用。

相信看了本文案例你已经掌握了方法,更多精彩请关注php中文网其它相关文章!

推荐阅读:

Vue如何操作html字段字符串转换为HTML标签

Koa2实现文件上传下载案例分析

The above is the detailed content of Implementation code for puppeteer simulated login and capture page. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Javascript Data Types : Is there any difference between Browser and NodeJs?Javascript Data Types : Is there any difference between Browser and NodeJs?May 14, 2025 am 12:15 AM

JavaScript core data types are consistent in browsers and Node.js, but are handled differently from the extra types. 1) The global object is window in the browser and global in Node.js. 2) Node.js' unique Buffer object, used to process binary data. 3) There are also differences in performance and time processing, and the code needs to be adjusted according to the environment.

JavaScript Comments: A Guide to Using // and /* */JavaScript Comments: A Guide to Using // and /* */May 13, 2025 pm 03:49 PM

JavaScriptusestwotypesofcomments:single-line(//)andmulti-line(//).1)Use//forquicknotesorsingle-lineexplanations.2)Use//forlongerexplanationsorcommentingoutblocksofcode.Commentsshouldexplainthe'why',notthe'what',andbeplacedabovetherelevantcodeforclari

Python vs. JavaScript: A Comparative Analysis for DevelopersPython vs. JavaScript: A Comparative Analysis for DevelopersMay 09, 2025 am 12:22 AM

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Python vs. JavaScript: Choosing the Right Tool for the JobPython vs. JavaScript: Choosing the Right Tool for the JobMay 08, 2025 am 12:10 AM

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript: Understanding the Strengths of EachPython and JavaScript: Understanding the Strengths of EachMay 06, 2025 am 12:15 AM

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScript's Core: Is It Built on C or C  ?JavaScript's Core: Is It Built on C or C ?May 05, 2025 am 12:07 AM

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript Applications: From Front-End to Back-EndJavaScript Applications: From Front-End to Back-EndMay 04, 2025 am 12:12 AM

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.

Python vs. JavaScript: Which Language Should You Learn?Python vs. JavaScript: Which Language Should You Learn?May 03, 2025 am 12:10 AM

Choosing Python or JavaScript should be based on career development, learning curve and ecosystem: 1) Career development: Python is suitable for data science and back-end development, while JavaScript is suitable for front-end and full-stack development. 2) Learning curve: Python syntax is concise and suitable for beginners; JavaScript syntax is flexible. 3) Ecosystem: Python has rich scientific computing libraries, and JavaScript has a powerful front-end framework.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),