Home >Backend Development >Python Tutorial >How to use Scrapy to crawl JD merchants' product data

How to use Scrapy to crawl JD merchants' product data

PHPz
PHPzOriginal
2023-06-23 08:01:231777browse

How to use Scrapy to crawl JD merchants’ product data

Scrapy is a powerful Python web crawler framework that allows us to easily and conveniently write code to crawl web page data. This article will introduce how to use Scrapy to crawl JD merchants’ product data.

Preparation

Before we start writing code, we need to make some preparations.

1. Install Scrapy

We need to install Scrapy locally. If you have not installed Scrapy yet, you can enter the following command in the command line:

pip install Scrapy

2. Create Scrapy Project

Open the terminal and enter the following command:

scrapy startproject JDspider

This line of command will create a Scrapy project named JDspider in the current folder.

3. Create Spider

In Scrapy, Spider is the core component for crawling data. We need to create a Spider to obtain the product data of JD merchants. Enter the following command on the command line:

cd JDspider
scrapy genspider JD jd.com

Here we use the scrapy genspider command to generate a Spider named JD, and use jd.com as its starting URL. The generated code is located in the JDspider/spiders/JD.py file. Now we need to edit this file to complete the crawler.

Analyze the target website

Before writing the code, we need to analyze the target website first. Let’s take https://mall.jd.com/index-1000000127.html as an example.

Open the Chrome browser, press the F12 key to open the developer tools, and then click the Network tab. After entering the URL of the target website, we can see the request and response information of the target website.

We can find that it uses AJAX technology to load product list data. In the XMLHttpRequest tab, we can see the URL of the request and it returned the data in JSON format.

We can directly access this URL to obtain product information.

Get product data

We now know how to get product information, we can add code in Spider to complete this task.

First open the JDspider/spiders/JD.py file and find the definition of the Spider class. We need to modify this class and define its name, domain name and initial URL.

class JdSpider(scrapy.Spider):
    name = "JD"
    allowed_domains = ["jd.com"]
    start_urls = [
        "https://pro.jd.com/mall/active/3W9j276jGAAFpgx5vds5msKg82gX/index.html"
    ]

Start fetching data. In Scrapy, we need to use the parse() method to obtain web page data. We use the json module to parse the returned JSON data and extract the required information. Here, we get the title, price, address and quantity information of the product.

    def parse(self, response):
        products = json.loads(response.body)['data']['productList']
        for product in products:
            title = product['name']
            price = product['pricer']
            address = product['storeName']
            count = product['totalSellCount']
            yield {
                'title': title,
                'price': price,
                'address': address,
                'count': count,
            }

Now we have completed the data capture. We can run this spider and output the results to a file. Enter the following command in the terminal to start running Spider:

scrapy crawl JD -o products.json
  • JD is the name of the Spider we created;
  • -o is the output option, specifying to save the captured results Where;
  • products.json is the file name, and the results will be saved in this file.

This is a simple example that just demonstrates how to use Scrapy to crawl JD merchants’ product data. In practical applications, we may need to perform more complex processing. Scrapy provides many powerful tools and modules to achieve this.

The above is the detailed content of How to use Scrapy to crawl JD merchants' product data. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn