Home  >  Article  >  Backend Development  >  How Scrapy parses HTML code

How Scrapy parses HTML code

WBOY
WBOYOriginal
2023-06-22 21:25:38906browse

Scrapy is a Python-based web crawler framework that can easily crawl and parse data on websites. When using Scrapy, parsing HTML code is an essential part. This article will introduce how Scrapy parses HTML code to help readers gain a deeper understanding of the use of Scrapy.

1. The principle of Scrapy parsing HTML code
In Scrapy, there are two ways to parse HTML code: XPath and CSS Selector. XPath is an XML path language that can traverse and select nodes of an XML document. CSS Selector is a CSS style selector that selects elements on the page through a syntax similar to CSS. When using Scrapy to parse HTML code, you can choose different parsing methods based on the structure of the page and the type of data that needs to be captured.

2. XPath parses HTML code
XPath is a common method for parsing HTML code in Scrapy. To use XPath, you can use the lxml library or the Selector library that comes with Scrapy. Below we use Selector in Scrapy as an example to introduce how to use XPath.

First, we need to obtain the source code of the page, which can be achieved using Scrapy's Request library.

from scrapy import Request

def parse(self, response):
    yield Request(url='http://example.com', callback=self.parse_page)
 
def parse_page(self, response):
    html = response.body

Next, we can use the Selector library to parse the HTML code. First construct a Selector object.

from scrapy.selector import Selector

selector = Selector(text=html)

Then, we can use XPath syntax to select the required elements. Commonly used XPath syntaxes are as follows:

  1. Select element
selector.xpath('//title') #选取所有的title元素
selector.xpath('//div[@class="example"]') #选取class为example的div元素
selector.xpath('//div[contains(@class, "example")and @id="content"]')#选取class包含example、id为content的div元素
  1. Select element attribute
selector.xpath('//a/@href') #选取所有a标签的href属性
  1. Select Element text
selector.xpath('//h1/text()') #选取h1标签的文本内容
selector.xpath('//p[contains(text(), "example")]/text()')#选取p标签中包含example文本内容的文本

The above is how to use XPath in Scrapy.

3. CSS Selector parses HTML code
CSS Selector is another commonly used method of parsing HTML code in Scrapy. Unlike XPath, CSS Selector uses the syntax of CSS style selectors. Below we use the Selector that comes with Scrapy as an example to introduce the use of CSS Selector.

First, we need to obtain the source code of the page, which can be achieved using Scrapy's Request library.

from scrapy import Request

def parse(self, response):
    yield Request(url='http://example.com', callback=self.parse_page)
 
def parse_page(self, response):
    html = response.body

Next, we can use the Selector library to parse the HTML code. Or construct a Selector object first.

from scrapy.selector import Selector

selector = Selector(text=html)

Use CSS Selector syntax to select elements.

selector.css('title') #选取所有的title元素
selector.css('div.example') #选取class为example的div元素
selector.css('div.example#content')#选取class为example、id为content的div元素
selector.css('a::attr(href)') #选取所有a标签的href属性
selector.css('h1::text') #选取h1标签的文本内容
selector.css('p:contains("example")::text') #选取p标签中包含example文本内容的文本

The above is how to use CSS Selector in Scrapy.

4. Summary
Through the introduction of this article, we can see two methods for Scrapy to parse HTML code: XPath and CSS Selector. Using these two methods, we can easily select the data we need from HTML. It should be noted that when selecting a parsing method, the appropriate method and syntax must be selected based on the structure of the page and the type of data that needs to be extracted.

The above is the detailed content of How Scrapy parses HTML code. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn