search
HomeWeb Front-endJS TutorialWeb Scraping Made Easy: Parse Any HTML Page with Puppeteer

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.

What is Web Scraping?

Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.

?What is Puppeteer?

Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).

We can automate various tasks using puppeteer,such as:

  • Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
  • PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
  • Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.

?How to get started with puppetter?

Firstly we have to install the library,go ahead and do this.
Using npm:

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.

Using yarn:

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.

Using pnpm:

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.

? Example to demonstrate the use of puppeteer

Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

?Explanation:

  • In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
  • Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
  • In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.

The above is the detailed content of Web Scraping Made Easy: Parse Any HTML Page with Puppeteer. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Replace String Characters in JavaScriptReplace String Characters in JavaScriptMar 11, 2025 am 12:07 AM

Detailed explanation of JavaScript string replacement method and FAQ This article will explore two ways to replace string characters in JavaScript: internal JavaScript code and internal HTML for web pages. Replace string inside JavaScript code The most direct way is to use the replace() method: str = str.replace("find","replace"); This method replaces only the first match. To replace all matches, use a regular expression and add the global flag g: str = str.replace(/fi

Custom Google Search API Setup TutorialCustom Google Search API Setup TutorialMar 04, 2025 am 01:06 AM

This tutorial shows you how to integrate a custom Google Search API into your blog or website, offering a more refined search experience than standard WordPress theme search functions. It's surprisingly easy! You'll be able to restrict searches to y

8 Stunning jQuery Page Layout Plugins8 Stunning jQuery Page Layout PluginsMar 06, 2025 am 12:48 AM

Leverage jQuery for Effortless Web Page Layouts: 8 Essential Plugins jQuery simplifies web page layout significantly. This article highlights eight powerful jQuery plugins that streamline the process, particularly useful for manual website creation

Build Your Own AJAX Web ApplicationsBuild Your Own AJAX Web ApplicationsMar 09, 2025 am 12:11 AM

So here you are, ready to learn all about this thing called AJAX. But, what exactly is it? The term AJAX refers to a loose grouping of technologies that are used to create dynamic, interactive web content. The term AJAX, originally coined by Jesse J

What is 'this' in JavaScript?What is 'this' in JavaScript?Mar 04, 2025 am 01:15 AM

Core points This in JavaScript usually refers to an object that "owns" the method, but it depends on how the function is called. When there is no current object, this refers to the global object. In a web browser, it is represented by window. When calling a function, this maintains the global object; but when calling an object constructor or any of its methods, this refers to an instance of the object. You can change the context of this using methods such as call(), apply(), and bind(). These methods call the function using the given this value and parameters. JavaScript is an excellent programming language. A few years ago, this sentence was

Improve Your jQuery Knowledge with the Source ViewerImprove Your jQuery Knowledge with the Source ViewerMar 05, 2025 am 12:54 AM

jQuery is a great JavaScript framework. However, as with any library, sometimes it’s necessary to get under the hood to discover what’s going on. Perhaps it’s because you’re tracing a bug or are just curious about how jQuery achieves a particular UI

10 Mobile Cheat Sheets for Mobile Development10 Mobile Cheat Sheets for Mobile DevelopmentMar 05, 2025 am 12:43 AM

This post compiles helpful cheat sheets, reference guides, quick recipes, and code snippets for Android, Blackberry, and iPhone app development. No developer should be without them! Touch Gesture Reference Guide (PDF) A valuable resource for desig

How do I create and publish my own JavaScript libraries?How do I create and publish my own JavaScript libraries?Mar 18, 2025 pm 03:12 PM

Article discusses creating, publishing, and maintaining JavaScript libraries, focusing on planning, development, testing, documentation, and promotion strategies.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)