Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.
What is Web Scraping?
Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.
?What is Puppeteer?
Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).
We can automate various tasks using puppeteer,such as:
- Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
- PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
- Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.
?How to get started with puppetter?
Firstly we have to install the library,go ahead and do this.
Using npm:
npm i puppeteer # Downloads compatible Chrome during installation. npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.
Using yarn:
yarn add puppeteer // Downloads compatible Chrome during installation. yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.
Using pnpm:
pnpm add puppeteer # Downloads compatible Chrome during installation. pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.
? Example to demonstrate the use of puppeteer
Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)
const puppeteer = require("puppeteer"); const CategorySchema = require("./models/Category"); // Define the scrape function as a named async function const scrape = async () => { // Launch a new browser instance const browser = await puppeteer.launch({ headless: false }); // Open a new page const page = await browser.newPage(); // Navigate to the target URL and wait until the DOM is fully loaded await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens%20sport%20wear', { waitUntil: 'domcontentloaded' }); // Wait for additional time to ensure all content is loaded await new Promise((resolve) => setTimeout(resolve, 25000)); // Extract product details from the page const items = await page.evaluate(() => { // Select all product elements const elements = document.querySelectorAll('.product-base'); const elementsArray = Array.from(elements); // Map each element to an object with the desired properties const results = elementsArray.map((element) => { const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src"); return { image: image ?? null, brand: element.querySelector(".product-brand")?.textContent, title: element.querySelector(".product-product")?.textContent, discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent, actualPrice: element.querySelector(".product-price .product-strike")?.textContent, discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1), total: 20, // Placeholder value, adjust as needed available: 10, // Placeholder value, adjust as needed ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration }; }); return results; // Return the list of product details }); // Close the browser await browser.close(); // Prepare the data for saving const data = { category: "mens-sport-wear", subcategory: "Mens", list: items }; // Create a new Category document and save it to the database // Since we want to store product information in our e-commerce store, we use a schema and save it to the database. // If you don't need to save the data, you can omit this step. const category = new CategorySchema(data); console.log(category); await category.save(); // Return the scraped items return items; }; // Export the scrape function as the default export module.exports = scrape;
?Explanation:
- In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
- Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
- In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.
The above is the detailed content of Web Scraping Made Easy: Parse Any HTML Page with Puppeteer. For more information, please follow other related articles on the PHP Chinese website!

The power of the JavaScript framework lies in simplifying development, improving user experience and application performance. When choosing a framework, consider: 1. Project size and complexity, 2. Team experience, 3. Ecosystem and community support.

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr

Node.js excels at efficient I/O, largely thanks to streams. Streams process data incrementally, avoiding memory overload—ideal for large files, network tasks, and real-time applications. Combining streams with TypeScript's type safety creates a powe

The differences in performance and efficiency between Python and JavaScript are mainly reflected in: 1) As an interpreted language, Python runs slowly but has high development efficiency and is suitable for rapid prototype development; 2) JavaScript is limited to single thread in the browser, but multi-threading and asynchronous I/O can be used to improve performance in Node.js, and both have advantages in actual projects.

JavaScript originated in 1995 and was created by Brandon Ike, and realized the language into C. 1.C language provides high performance and system-level programming capabilities for JavaScript. 2. JavaScript's memory management and performance optimization rely on C language. 3. The cross-platform feature of C language helps JavaScript run efficiently on different operating systems.

JavaScript runs in browsers and Node.js environments and relies on the JavaScript engine to parse and execute code. 1) Generate abstract syntax tree (AST) in the parsing stage; 2) convert AST into bytecode or machine code in the compilation stage; 3) execute the compiled code in the execution stage.

The future trends of Python and JavaScript include: 1. Python will consolidate its position in the fields of scientific computing and AI, 2. JavaScript will promote the development of web technology, 3. Cross-platform development will become a hot topic, and 4. Performance optimization will be the focus. Both will continue to expand application scenarios in their respective fields and make more breakthroughs in performance.

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1
Easy-to-use and free code editor

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
