


Scraping Infinite Scroll Pages with a Load More Button: A Step-by-Step Guide
Are your scrapers stuck when trying to load data from dynamic web pages? Are you frustrated with infinite scrolls or those pesky "Load more" buttons?
You're not alone. Many websites today implement these designs to improve user experience—but they can be challenging for web scrapers.
This tutorial will guide you through a beginner-friendly walkthrough for scraping a demo page with a Load More button. Here’s what the target web page looks like:
By the end, you'll learn how to:
- Set up Selenium for web scraping.
- Automate the "Load more" button interaction.
- Extract product data such as names, prices, and links.
Let's dive in!
Step 1: Prerequisites
Before diving in, ensure the following prerequisites:
- Python Installed: Download and install the latest Python version from python.org, including pip during setup.
- Basic Knowledge: Familiarity with web scraping concepts, Python programming, and working with libraries such as requests, BeautifulSoup, and Selenium.
Libraries Required:
- Requests: For sending HTTP requests.
- BeautifulSoup: For parsing the HTML content.
- Selenium: For simulating user interactions like button clicks in a browser.
You can install these libraries using the following command in your terminal:
pip install requests beautifulsoup4 selenium
Before using Selenium, you must install a web driver matching your browser. For this tutorial, we'll use Google Chrome and ChromeDriver. However, you can follow similar steps for other browsers like Firefox or Edge.
Install the Web Driver
- Check your browser version:
Open Google Chrome and navigate to Help > About Google Chrome from the three-dot menu to find the Chrome version.
Download ChromeDriver:
Visit the ChromeDriver download page.
Download the driver version that matches your Chrome version.
Add ChromeDriver to your system PATH:
Extract the downloaded file and place it in a directory like /usr/local/bin (Mac/Linux) or C:WindowsSystem32 (Windows).
Verify Installation
Initialize a Python file scraper.py in your project directory and test that everything is set up correctly by running the following code snippet:
from selenium import webdriver driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH driver.get("https://www.scrapingcourse.com/button-click") print(driver.title) driver.quit()
You can execute the above file code by running the following command on your terminal:
pip install requests beautifulsoup4 selenium
If the above code runs without errors, it will spin up a browser interface and open the demo page URL as shown below:
Selenium will then extract the HTML and print the page title. You will see an output like this -
from selenium import webdriver driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH driver.get("https://www.scrapingcourse.com/button-click") print(driver.title) driver.quit()
This verifies that Selenium is ready to use. With all requirements installed and ready to use, you can start accessing the demo page's content.
Step 2: Get Access to the Content
The first step is to fetch the page's initial content, which gives you a baseline snapshot of the page's HTML. This will help you verify connectivity and ensure a valid starting point for the scraping process.
You will retrieve the HTML content of the page URL by sending a GET request using the Requests library in Python. Here's the code:
python scraper.py
The above code will output the raw HTML containing the data for the first 12 products.
This quick preview of the HTML ensures that the request was successful and that you're working with valid data.
Step 3: Load More Products
To access the remaining products, you'll need to programmatically click the "Load more" button on the page until no more products are available. Since this interaction involves JavaScript, you will use Selenium to simulate the button click.
Before writing code, let’s inspect the page to locate:
- The "Load more" button selector (load-more-btn).
- The div holding the product details (product-item).
You'll get all the products by loading more products, giving you a larger dataset by running the following code:
Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com
This code opens the browser, navigates to the page, and interacts with the "Load more" button. The updated HTML, now containing more product data, is then extracted.
If you don’t want Selenium to open the browser every time you run this code, it also provides headless browser capabilities. A headless browser has all the functionalities of an actual web browser but no Graphical User Interface (GUI).
You can enable the headless mode for Chrome in Selenium by defining a ChromeOptions object and passing it to the WebDriver Chrome constructor like this:
import requests # URL of the demo page with products url = "https://www.scrapingcourse.com/button-click" # Send a GET request to the URL response = requests.get(url) # Check if the request was successful if response.status_code == 200: html_content = response.text print(html_content) # Optional: Preview the HTML else: print(f"Failed to retrieve content: {response.status_code}")
When you run the above code, Selenium will launch a headless Chrome instance, so you’ll no longer see a Chrome window. This is ideal for production environments where you don’t want to waste resources on the GUI when running the scraping script on a server.
Now that the complete HTML content is retrieved extracting specific details about each product is time.
Step 4: Parse Product Information
In this step, you'll use BeautifulSoup to parse the HTML and identify product elements. Then, you'll extract key details for each product, such as the name, price, and links.
pip install requests beautifulsoup4 selenium
In the output, you should see a structured list of product details, including the name, image URL, price, and product page link, like this -
from selenium import webdriver driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH driver.get("https://www.scrapingcourse.com/button-click") print(driver.title) driver.quit()
The above code will organize the raw HTML data into a structured format, making it easier to work with and preparing the output data for further processing.
Step 5: Export Product Information to CSV
You can now organize the extracted data into a CSV file, which makes it easier to analyze or share. Python's CSV module helps with this.
python scraper.py
The above code will create a new CSV file with all the required product details.
Here's the complete code for an overview:
Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com
The above code will create a products.csv which would look like this:
import requests # URL of the demo page with products url = "https://www.scrapingcourse.com/button-click" # Send a GET request to the URL response = requests.get(url) # Check if the request was successful if response.status_code == 200: html_content = response.text print(html_content) # Optional: Preview the HTML else: print(f"Failed to retrieve content: {response.status_code}")
Step 6: Get Extra Data for Top Products
Now, let's say you want to identify the top 5 highest-priced products and extract additional data (such as the product description and SKU code) from their individual pages. You can do that using the code as follows:
from selenium import webdriver from selenium.webdriver.common.by import By import time # Set up the WebDriver (make sure you have the appropriate driver installed, e.g., ChromeDriver) driver = webdriver.Chrome() # Open the page driver.get("https://www.scrapingcourse.com/button-click") # Loop to click the "Load More" button until there are no more products while True: try: # Find the "Load more" button by its ID and click it load_more_button = driver.find_element(By.ID, "load-more-btn") load_more_button.click() # Wait for the content to load (adjust time as necessary) time.sleep(2) except Exception as e: # If no "Load More" button is found (end of products), break out of the loop print("No more products to load.") break # Get the updated page content after all products are loaded html_content = driver.page_source # Close the browser window driver.quit()
Here's the complete code for an overview:
from selenium import webdriver from selenium.webdriver.common.by import By import time # instantiate a Chrome options object options = webdriver.ChromeOptions() # set the options to use Chrome in headless mode options.add_argument("--headless=new") # initialize an instance of the Chrome driver (browser) in headless mode driver = webdriver.Chrome(options=options) ...
This code sorts the products by price in descending order. Then, for the top 5 highest-priced products, the script opens their product pages and extracts the product description and SKU using BeautifulSoup.
The output of the above code will be like this:
from bs4 import BeautifulSoup # Parse the page content with BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') # Extract product details products = [] # Find all product items in the grid product_items = soup.find_all('div', class_='product-item') for product in product_items: # Extract the product name name = product.find('span', class_='product-name').get_text(strip=True) # Extract the product price price = product.find('span', class_='product-price').get_text(strip=True) # Extract the product link link = product.find('a')['href'] # Extract the image URL image_url = product.find('img')['src'] # Create a dictionary with the product details products.append({ 'name': name, 'price': price, 'link': link, 'image_url': image_url }) # Print the extracted product details for product in products[:2]: print(f"Name: {product['name']}") print(f"Price: {product['price']}") print(f"Link: {product['link']}") print(f"Image URL: {product['image_url']}") print('-' * 30)
The above code will update the products.csv and it will now have information like this:
Name: Chaz Kangeroo Hoodie Price: Link: https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg ------------------------------ Name: Teton Pullover Hoodie Price: Link: https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg ------------------------------ …
Conclusion
Scraping pages with infinite scrolling or "Load more" buttons can seem challenging, but using tools like Requests, Selenium, and BeautifulSoup simplifies the process.
This tutorial showed how to retrieve and process product data from a demo page, saving it in a structured format for quick and easy access.
See all the code snippets here.
The above is the detailed content of Scraping Infinite Scroll Pages with a Load More Button: A Step-by-Step Guide. For more information, please follow other related articles on the PHP Chinese website!

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 English version
Recommended: Win version, supports code prompts!

Atom editor mac version download
The most popular open source editor

Notepad++7.3.1
Easy-to-use and free code editor

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.
