Web Scraping Made Easy: Parse Any HTML Page with Puppeteer-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Sep 05, 2024 pm 10:34 PM

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.

What is Web Scraping?

Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.

?What is Puppeteer?

Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).

We can automate various tasks using puppeteer,such as:

Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.

?How to get started with puppetter?

Firstly we have to install the library,go ahead and do this.
Using npm:

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.

Using yarn:

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.

Using pnpm:

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.

? Example to demonstrate the use of puppeteer

Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

?Explanation:

In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.

The above is the detailed content of Web Scraping Made Easy: Parse Any HTML Page with Puppeteer. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

JavaScript Frameworks: Powering Modern Web DevelopmentMay 02, 2025 am 12:04 AM

The power of the JavaScript framework lies in simplifying development, improving user experience and application performance. When choosing a framework, consider: 1. Project size and complexity, 2. Team experience, 3. Ecosystem and community support.

The Relationship Between JavaScript, C , and BrowsersMay 01, 2025 am 12:06 AM

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr

Node.js Streams with TypeScriptApr 30, 2025 am 08:22 AM

Node.js excels at efficient I/O, largely thanks to streams. Streams process data incrementally, avoiding memory overload—ideal for large files, network tasks, and real-time applications. Combining streams with TypeScript's type safety creates a powe

Python vs. JavaScript: Performance and Efficiency ConsiderationsApr 30, 2025 am 12:08 AM

The differences in performance and efficiency between Python and JavaScript are mainly reflected in: 1) As an interpreted language, Python runs slowly but has high development efficiency and is suitable for rapid prototype development; 2) JavaScript is limited to single thread in the browser, but multi-threading and asynchronous I/O can be used to improve performance in Node.js, and both have advantages in actual projects.

The Origins of JavaScript: Exploring Its Implementation LanguageApr 29, 2025 am 12:51 AM

JavaScript originated in 1995 and was created by Brandon Ike, and realized the language into C. 1.C language provides high performance and system-level programming capabilities for JavaScript. 2. JavaScript's memory management and performance optimization rely on C language. 3. The cross-platform feature of C language helps JavaScript run efficiently on different operating systems.

Behind the Scenes: What Language Powers JavaScript?Apr 28, 2025 am 12:01 AM

JavaScript runs in browsers and Node.js environments and relies on the JavaScript engine to parse and execute code. 1) Generate abstract syntax tree (AST) in the parsing stage; 2) convert AST into bytecode or machine code in the compilation stage; 3) execute the compiled code in the execution stage.

The Future of Python and JavaScript: Trends and PredictionsApr 27, 2025 am 12:21 AM

The future trends of Python and JavaScript include: 1. Python will consolidate its position in the fields of scientific computing and AI, 2. JavaScript will promote the development of web technology, 3. Cross-platform development will become a hot topic, and 4. Performance optimization will be the focus. Both will continue to expand application scenarios in their respective fields and make more breakthroughs in performance.

Python vs. JavaScript: Development Environments and ToolsApr 26, 2025 am 12:09 AM

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

4 weeks agoByDDD

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

InZoi: How To Apply To School And University

1 months agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Where to find the Site Office Key in Atomfall

4 weeks agoByDDD

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.