Home  >  Article  >  Web Front-end  >  Scrape the web with puppeteer!

Scrape the web with puppeteer!

WBOY
WBOYOriginal
2024-08-29 11:06:52822browse

Scrape the web with puppeteer!

Puppeteer full guide pt.1

Puppeteer: The Power Tool for Web Automation

In today's fast-paced web development landscape, automation is key—and that's where Puppeteer comes in. Developed by Google, Puppeteer is a powerful Node.js library that allows developers to control Chrome browsers using JavaScript. Whether you're navigating the web in headless mode for efficiency or in a full browser for visual feedback, Puppeteer makes it easier than ever to automate tasks like web scraping, testing, and more. With Puppeteer, what once required manual effort is now just a script away.

Why web scraping?

In a recent project, I worked with a client who needed a landing page for his forex trading community. He wanted something similar to the stock tickers you see on MarketWatch or Yahoo Finance, but instead of stocks, he wanted real-time currency conversion rates for $1 USD displayed across the site.

While there are APIs available that could provide the data—with usage limits and monthly fees—I saw an opportunity to create a custom solution using Puppeteer. By investing some time upfront, I was able to scrape and display the data for free, ultimately saving my client from recurring costs.

Clients website: Majesticpips.com

Setting up puppeteer made simple

Before we can start scraping the web for all its glory, we must install puppeteer to our application.

Just as described on the docs

Step 1

Install library using your choice of npm, yarn or pnpm.

  • npm i puppeteer

  • yarn add puppeteer

  • pnpm add puppeteer

This will download compatible version of chrome during installation which is easier for beginners to get things up and running quickly.

If you are a more seasoned developer and have specific chrome/chromium version you'd like to work with; then installing these packages

  • npm i puppeteer-core

  • yarn add puppeteer-core

  • pnpm add puppeteer-core

would be best for you, the package will be lightweight as it only installs puppeteer and leaves the chrome version up to you to decide.

Installing 'puppeteer' is the better option for first time tester. It simplifies the setup and ensures you have a working version of Chromium, allowing you to focus on writing your scripts.

Step 2

now on your JS file, you'd want to import puppeteer for applications using ES module systems(ES6 standards) with node versions 12 and above.

import puppeteer from 'puppeteer'; (recommended)
or
import puppeteer from 'puppeteer-core';

or you can use the require syntax for commonJs module system for Node.js that is also compatible with older versions of Node.js.

const puppeteer = require('puppeteer');
or
const puppeteer = require('puppeteer-core');

Step 3

After importing Puppeteer, we can start writing the commands to perform web scraping. The code below shows what you'll need to use.

We launch the browser using these methods provided by the library.

const browser = await puppeteer.launch();

const page = await browser.newPage();

await browser.close();

puppeteer.launch() = This method launches a new browser instance.

browser.newPage() = This method creates a new page (or tab) within the browser instance.

browser.close() = This method closes the browser instance.

In puppeteer.launch(), we can pass arguments to customize the browser launch according to our preferences. We’ll cover this in more detail in part 2. However, by default, puppeteer.launch() has preset values, such as headless mode being set to true.

Step 4

The browser has been launched, and we now have a page ready to surf the web. Let's navigate to the website where we'll scrape some data.

For this example, we'll be scraping data from a qoutes website.

 await page.goto(https://quotes.toscrape.com/)

 await page.screenshot({ path: 'screenshot.png' })

I've added await page.screenshot({ path: 'screenshot.png' }) to the mix. This is a great tool to ensure everything is going according to plan. When this code executes, you'll have an image file in your project directory capturing the current state of the website you're scraping. You can also adjust the file name to your liking.

If everything checks out then proceed to step 5.

Step 5

Now that our script is taking shape, let’s dive into the key part where we extract data from the web page. Here's how our script looks so far:

const puppeteer = require('puppeteer');

(async () => {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto(https://quotes.toscrape.com/)

await page.screenshot({ path: 'screenshot.png' })

 const quotesScraper = await page.evaluate(() => {

const quotes = document.querySelectorAll(".quote"); 
    const quotesArray = [];

   for (const quote of quotes) { 
       const texts = quote.querySelector(".text").innerText; 
         const author = quote.querySelector(".author").innerText;  

        quotesArray.push({
           quote: texts,
           author
         });

     }
     return quotesArray;
});

console.log(quotesScraper);

await browser.close();

})();

To verify that the data was successfully scraped, we can run node "server-file-name" in the CLI, and the data will be displayed in the console using console.log(quotesScraper);.

[
  {
    quote: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'Albert Einstein'
  },
  {
    quote: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
    author: 'J.K. Rowling'
  },
  {
    quote: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
    author: 'Albert Einstein'
  },
  {
    quote: '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
    author: 'Jane Austen'
  },
  {
    quote: "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
    author: 'Marilyn Monroe'
  }
....
]

await page.evaluate(() => { ... }): This is where the magic happens. The evaluate method allows us to run JavaScript code within the context of the page we're scraping. It's as if you're opening the browser's developer console and running the code directly on the page.

const quotes = document.querySelectorAll(".quote");: Here, we're selecting all elements on the page that match the .quote class. This gives us a NodeList of quote elements.

const quotesArray = [];: We initialize an empty array to store the quotes we extract.

for (const quote of quotes) { ... }: This loop iterates over each quote element. For each one, we extract the text of the quote and the author.

quotesArray.push({ quote: texts, author });: For each quote, we create an object containing the quote text and the author, then push this object into the quotesArray.

return quotesArray;: Finally, we return the array of quotes, which is then stored in quotesScraper in our Node.js environment.

This method of extracting data is powerful because it allows you to interact with the page just like a user would, but in an automated and programmatic way.

Closing the Browser

await browser.close();: After scraping the data, it's important to close the browser to free up resources. This line ensures that the browser instance we launched is properly shut down.

Looking Ahead to Part 2

With this script, you've successfully scraped data from a website using Puppeteer. But we're just scratching the surface of what's possible. In Part 2, we’ll explore more advanced techniques like handling dynamic content and use Express.JS to create API functionality of scrapped data. Stay tuned as we delve deeper into the world of Puppeteer!

The above is the detailed content of Scrape the web with puppeteer!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn