Home  >  Article  >  Backend Development  >  In-depth use of Scrapy: How to crawl HTML, XML, and JSON data?

In-depth use of Scrapy: How to crawl HTML, XML, and JSON data?

WBOY
WBOYOriginal
2023-06-22 17:58:401824browse

Scrapy is a powerful Python crawler framework that can help us obtain data on the Internet quickly and flexibly. In the actual crawling process, we often encounter various data formats such as HTML, XML, and JSON. In this article, we will introduce how to use Scrapy to crawl these three data formats respectively.

1. Crawl HTML data

  1. Create a Scrapy project

First, we need to create a Scrapy project. Open the command line and enter the following command:

scrapy startproject myproject

This command will create a Scrapy project called myproject in the current folder.

  1. Set the starting URL

Next, we need to set the starting URL. In the myproject/spiders directory, create a file named spider.py, edit the file, and enter the following code:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        pass

The code first imports the Scrapy library, then defines a crawler class MySpider, and sets a name is the spider name of myspider, and sets a starting URL to http://example.com. Finally, a parse method is defined. The parse method will be called by Scrapy by default to process response data.

  1. Parse the response data

Next, we need to parse the response data. Continue to edit the myproject/spiders/spider.py file and add the following code:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        title = response.xpath('//title/text()').get()
        yield {'title': title}

In the code, we use the response.xpath() method to obtain the title in the HTML page. Use yield to return dictionary type data, including the title we obtained.

  1. Run the crawler

Finally, we need to run the Scrapy crawler. Enter the following command on the command line:

scrapy crawl myspider -o output.json

This command will output the data to the output.json file.

2. Crawl XML data

  1. Create a Scrapy project

Similarly, we first need to create a Scrapy project. Open the command line and enter the following command:

scrapy startproject myproject

This command will create a Scrapy project called myproject in the current folder.

  1. Set the starting URL

In the myproject/spiders directory, create a file named spider.py, edit the file, and enter the following code:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com/xml']

    def parse(self, response):
        pass

In the code, we set a spider name named myspider and set a starting URL to http://example.com/xml.

  1. Parse the response data

Continue to edit the myproject/spiders/spider.py file and add the following code:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com/xml']

    def parse(self, response):
        for item in response.xpath('//item'):
            yield {
                'title': item.xpath('title/text()').get(),
                'link': item.xpath('link/text()').get(),
                'desc': item.xpath('desc/text()').get(),
            }

In the code, we use response. xpath() method to obtain the data in the XML page. Use a for loop to traverse the item tag, obtain the text data in the title, link, and desc tags, and use yield to return dictionary type data.

  1. Run the crawler

Finally, we also need to run the Scrapy crawler. Enter the following command on the command line:

scrapy crawl myspider -o output.json

This command will output the data to the output.json file.

3. Crawl JSON data

  1. Create a Scrapy project

Similarly, we need to create a Scrapy project. Open the command line and enter the following command:

scrapy startproject myproject

This command will create a Scrapy project called myproject in the current folder.

  1. Set the starting URL

In the myproject/spiders directory, create a file named spider.py, edit the file, and enter the following code:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com/json']

    def parse(self, response):
        pass

In the code, we set a spider name named myspider and set a starting URL to http://example.com/json.

  1. Parse the response data

Continue to edit the myproject/spiders/spider.py file and add the following code:

import scrapy
import json

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com/json']

    def parse(self, response):
        data = json.loads(response.body)
        for item in data['items']:
            yield {
                'title': item['title'],
                'link': item['link'],
                'desc': item['desc'],
            }

In the code, we use json. loads() method to parse JSON format data. Use a for loop to traverse the items array, obtain the three attributes of each item: title, link, and desc, and use yield to return dictionary type data.

  1. Run the crawler

Finally, you also need to run the Scrapy crawler. Enter the following command on the command line:

scrapy crawl myspider -o output.json

This command will output the data to the output.json file.

4. Summary

In this article, we introduced how to use Scrapy to crawl HTML, XML, and JSON data respectively. Through the above examples, you can understand the basic usage of Scrapy, and you can also learn more advanced usage in depth as needed. I hope it can help you with crawler technology.

The above is the detailed content of In-depth use of Scrapy: How to crawl HTML, XML, and JSON data?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn