search
HomeBackend DevelopmentPython TutorialHow to parse HTML pages with Python crawler

Parsing HTML pages with Python

We usually need to parse web crawled pages to get the data we need. By analyzing the combined structure of HTML tags, we can extract useful information contained in web pages. In Python, there are three common ways to parse HTML: regular expression parsing, XPath parsing, and CSS selector parsing.

The structure of HTML page

Understanding the basic structure of HTML page is a prerequisite before explaining the HTML parsing method. When we open a website in a browser and select the "Show web page source code" menu item through the right-click menu of the mouse, we can see the HTML code corresponding to the web page. HTML code usually consists of tags, attributes, and text. The label carries the content displayed on the page, the attributes supplement the label information, and the text is the content displayed by the label. The following is a simple HTML page code structure example:

<!DOCTYPE html>
<html>
    <head>
        <!-- head 标签中的内容不会在浏览器窗口中显示 -->
        <title>这是页面标题</title>
    </head>
    <body>
        <!-- body 标签中的内容会在浏览器窗口中显示 -->
        <h2 id="这是一级标题">这是一级标题</h2>
        <p>这是一段文本</p>
    </body>
</html>

In this HTML page code example, is the document type declaration, <code> The tag is the root tag of the entire page, and are sub-tags of the tag, placed in The content under the tag will be displayed in the browser window. This part of the content is the main body of the web page; the content under the tag will not be displayed in the browser window. It is displayed in the browser window, but it contains important meta-information of the page, usually called the header of the web page. The general code structure of an HTML page is as follows:

<!DOCTYPE html>
<html>
    <head>
        <!-- 页面的元信息,如字符编码、标题、关键字、媒体查询等 -->
    </head>
    <body>
        <!-- 页面的主体,显示在浏览器窗口中的内容 -->
    </body>
</html>

tags, cascading style sheets (CSS) and JavaScript are the three basic components that make up an HTML page. Tags are used to carry the content to be displayed on the page, CSS is responsible for rendering the page, and JavaScript is used to control the interactive behavior of the page. To parse HTML pages, you can use XPath syntax, which is originally a query syntax for XML. It can extract content or tag attributes in tags based on the hierarchical structure of HTML tags. In addition, you can also use CSS selectors to locate pages. Elements are the same as rendering page elements using CSS.

XPath parsing

XPath is a syntax for finding information in XML (eXtensible Markup Language) documents. XML is similar to HTML and is a tag language that uses tags to carry data. The difference The reason is that XML tags are extensible and customizable, and XML has stricter syntax requirements. XPath uses path expressions to select nodes or node sets in XML documents. The nodes mentioned here include elements, attributes, text, namespaces, processing instructions, comments, root nodes, etc.

XPath path expression is similar to file path syntax, you can use "/" and "//" to select nodes. When selecting the root node, you can use a single slash "/"; when selecting a node at any position, you can use a double slash "//". For example, "/bookstore/book" means selecting all book sub-nodes under the root node bookstore, and "//title" means selecting the title node at any position.

XPath can also use predicates to filter nodes. Nested expressions within square brackets can be numbers, comparison operators, or function calls that serve as predicates. For example, "/bookstore/book[1]" means selecting the first child node book of bookstore, and "//book[@lang]" means selecting all book nodes with the lang attribute.

XPath functions include string, mathematical, logical, node, sequence and other functions. These functions can be used to select nodes, calculate values, convert data types and other operations. For example, the "string-length(string)" function can return the length of the string, and the "count(node-set)" function can return the number of nodes in the node set.

Below we use an example to illustrate how to use XPath to parse the page. Suppose we have the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book>
      <title lang="eng">Harry Potter</title>
      <price>29.99</price>
    </book>
    <book>
      <title lang="zh">Learning XML</title>
      <price>39.95</price>
    </book>
</bookstore>

For this XML file, we can use the XPath syntax as shown below to get the nodes in the document.

##/bookstoreSelect the root element bookstore. Note: If the path starts with a forward slash ( / ), this path always represents an absolute path to an element! //bookSelects all book child elements regardless of their position in the document. //@langSelect all attributes named lang. /bookstore/book[1]Select the first child node book of bookstore.

CSS 选择器解析

通过HTML标签的属性和关系来定位元素的方式被称为CSS选择器。根据 HTML 标签的层级结构、类名、id 等属性能够确定元素的位置。在 Python 中,我们可以使用 BeautifulSoup 库来进行 CSS 选择器解析。

我们接下来会举一个例子,讲解如何运用 CSS 选择器来分析页面。假设我们有如下的 HTML 代码:

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2 id="这是一级标题">这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>

我们可以使用如下所示的 CSS 选择器语法来选取页面元素。

Path expression Result
选择器 结果
div.content 选取 class 为 content 的 div 元素。
h2 选取所有的 h2 元素。
div.footer p 选取 class 为 footer 的 div 元素下的所有 p 元素。
[href] 选取所有具有 href 属性的元素。

正则表达式解析

用正则表达式可以解析 HTML 页面,从而实现文本的匹配、查找和替换。使用 re 模块可以进行 Python 的正则表达式解析。

下面我们通过一个例子来说明如何使用正则表达式对页面进行解析。假设我们有如下的 HTML 代码:

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2 id="这是一级标题">这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>

我们可以使用如下所示的正则表达式来选取页面元素。

import re
html = '''
<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>这是页面标题</title>
</head>
<body>
	<div class="content">
		<h2 id="这是一级标题">这是一级标题</h2>
		<p>这是一段文本</p>
	</div>
	<div class="footer">
		<p>版权所有 © 2021</p>
	</div>
</body>
</html>
'''
pattern = re.compile(r'
.*?

(.*?)

.*?

(.*?)

.*?
', re.S) match = re.search(pattern, html) if match: title = match.group(1) text = match.group(2) print(title) print(text)

以上代码中,我们使用 re 模块的 compile 方法来编译正则表达式,然后使用 search 方法来匹配 HTML 代码。在正则表达式中,“.*?”表示非贪婪匹配,也就是匹配到第一个符合条件的标签就停止匹配,而“re.S”表示让“.”可以匹配包括换行符在内的任意字符。最后,我们使用 group 方法来获取匹配的结果。

The above is the detailed content of How to parse HTML pages with Python crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:亿速云. If there is any infringement, please contact admin@php.cn delete
HTML超文本标记语言--超在那里?(文档分析)HTML超文本标记语言--超在那里?(文档分析)Aug 02, 2022 pm 06:04 PM

本篇文章带大家了解一下HTML(超文本标记语言),介绍一下HTML的本质,HTML文档的结构、HTML文档的基本标签和图像标签、列表、表格标签、媒体元素、表单,希望对大家有所帮助!

html和css算编程语言吗html和css算编程语言吗Sep 21, 2022 pm 04:09 PM

不算。html是一种用来告知浏览器如何组织页面的标记语言,而CSS是一种用来表现HTML或XML等文件样式的样式设计语言;html和css不具备很强的逻辑性和流程控制功能,缺乏灵活性,且html和css不能按照人类的设计对一件工作进行重复的循环,直至得到让人类满意的答案。

web前端笔试题库之HTML篇web前端笔试题库之HTML篇Apr 21, 2022 am 11:56 AM

总结了一些web前端面试(笔试)题分享给大家,本篇文章就先给大家分享HTML部分的笔试题(附答案),大家可以自己做做,看看能答对几个!

总结HTML中a标签的使用方法及跳转方式总结HTML中a标签的使用方法及跳转方式Aug 05, 2022 am 09:18 AM

本文给大家总结介绍a标签使用方法和跳转方式,希望对大家有所帮助!

HTML5中画布标签是什么HTML5中画布标签是什么May 18, 2022 pm 04:55 PM

HTML5中画布标签是“<canvas>”。canvas标签用于图形的绘制,它只是一个矩形的图形容器,绘制图形必须通过脚本(通常是JavaScript)来完成;开发者可利用多种js方法来在canvas中绘制路径、盒、圆、字符以及添加图像等。

html中document是什么html中document是什么Jun 17, 2022 pm 04:18 PM

在html中,document是文档对象的意思,代表浏览器窗口的文档;document对象是window对象的子对象,所以可通过“window.document”属性对其进行访问,每个载入浏览器的HTML文档都会成为Document对象。

html5废弃了哪个列表标签html5废弃了哪个列表标签Jun 01, 2022 pm 06:32 PM

html5废弃了dir列表标签。dir标签被用来定义目录列表,一般和li标签配合使用,在dir标签对中通过li标签来设置列表项,语法“<dir><li>列表项值</li>...</dir>”。HTML5已经不支持dir,可使用ul标签取代。

Html5怎么取消td边框Html5怎么取消td边框May 18, 2022 pm 06:57 PM

3种取消方法:1、给td元素添加“border:none”无边框样式即可,语法“td{border:none}”。2、给td元素添加“border:0”样式,语法“td{border:0;}”,将td边框的宽度设置为0即可。3、给td元素添加“border:transparent”样式,语法“td{border:transparent;}”,将td边框的颜色设置为透明即可。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Atom editor mac version download

Atom editor mac version download

The most popular open source editor