Python Tutorial

python Xpath syntax

coldplay.xixi

Nov 26, 2020 pm 05:11 PM

pythonxpath

python video tutorial column introduces the Xpath syntax of python.

python Xpath syntax

1. Introduction to XML

(1) What is XML

XML refers to Extensible Markup Language (EXtensible)
XML is a markup language, very similar to HTML.
XML is designed to transmit data, not display it.
XML tags need to be defined by ourselves.
XML is designed to be self-describing.
XML is a W3C recommended standard.

W3School official document: http://www.w3school.com.cn/xml/index.asp

(2) The difference between XML and HTML

Both of them are used to manipulate data or structure data. They are roughly the same in structure, but there are obvious differences in their essence.

Data Format	Description	Design Goal
XML	Extensible Markup Language (Extensible Markup Language)	is designed to transmit and store data, and its focus is the content of the data.
HTML	HyperText Markup Language	Display data and how to better display data.
HTML DOM	Document Object Model for HTML	Through HTML DOM, all HTML elements can be accessed. along with the text and attributes they contain. The content can be modified and deleted, and new elements can also be created.

(3) XML node relationship

<?XML  version=&#39;1.0&#39; encoding=""utf-8><book>
	<title>Harry Potter</title>
	<author>J K.Rowling</author>
	<year>2005</year>
	<price>29.00</price></book>

1. Parent

Each element and attribute has a parent . The above is a simple XML example. The book element is the parent of the title, author, year and price elements

2. Children(Children)

The element node can have zero, one or more child elements. In the above example, the title, author, year and price elements are all child elements of the book element

3. Sibling(Sibling)

Nodes that have the same parent. In the above example, the title, author, year, and price elements are all siblings

4. Ancestor

The parent of a node, the parent of the parent, and so on. In the above example, the ancestors of the title element are the book element and the bookstore element

5. Descendant

The child of a certain node, the child of the child, etc. In the above example, the descendants of bookstore are book, title, author, year and price elements:

2. XPATH

XPath (XML Path Language) is a method for searching in XML documents A message language that can be used to traverse elements and attributes in XML documents.

(1) Select nodes

XPath uses path expressions to select nodes or node sets in XML documents. These path expressions are very similar to those we see in regular computer file systems. The most commonly used path expressions are listed below:

Expression	Description
nodename	Select all child nodes of this node.
/	Select from node.
//	Selects nodes in the document from the current node matching the selection, regardless of their position.
.	Select the current node.
..	Select the parent node of the current node.
@	Select attributes.

在下面的表格中，我们已列出了一些路径表达式以及表达式的结果：

路径表达式	描述
bookstore	选取 bookstore 元素的所有子节点
/bookstore	选取根元素 bookstore。代表元素的绝对路径。
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置
bookstore//book	选择属于 booksore 元素的后代所有的 book 元素，而不管他们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。
text()	取标签当中的值

（二）谓语(Predicates)

谓语用来查找某个特定的节点或者包含某个指定的值的节点，被嵌在方括号中。在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式	描述
/bookstore/book[l]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()	选最前面的一个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有属性名为 lang 的属性的 title 元素。
//titlel@lang=‘eng’]	选取所有 tltle 元素，且这些元素有属性值为 eng 的 lang 属性。

（三）选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	描述
/bookstore/*	选取 bookstore 元素的所有子元素
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。

（四）选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	描述
//book/title	//book/price
//title	//price
//price	选取文档中所有的 price 元素。

三、lxml 模块

（一）lxml 简介与安装

lxml 是一个 HTML/XML 的解析器，主要的功能是如何解析和提取 HTML/XML 数据。我们可以利用之前学习的 XPath 语法，来快速的定位特定元素以及节点信息。
安装方法:pip install lxml

（二）lxml 初步使用

1、解析HTML字符串

XML 素材：http://www.cnblogs.com/zhangboblogs/p/10114698.html
小结：lxml 可以自动修正 html 代码，例子里不仅补全了 li 标签，还添加了 body，html 标签。

2.、lxml 文件读取

XML 素材：http://www.cnblogs.com/zhangboblogs/p/10114698.htm
除了直接读取字符串，lxml 还支持从文件里读取内容。我们新建一个 hello.html 文件，再利用 etree.parse()方法来读取文件。
注意：从文件中读取数据，要求文件内容符合 xml 格式，如果标签缺失，则不能正常读取。

四、XPath 节点信息解析：

# 安装lxml: pip install lxml

# 1. 导入etree: 两种导入方式
# 第一种: 直接导入
from lxml import etree
# 注意: 此种导入方式,可能会导致报错(etree下面会出现红色波浪线,不影响正常使用)

# 第二种: 
# from lxml import html
# etree = html.etree

str = '<bookstore>' \
            '<book>' \
                '<title>Harry Potter</title>' \
                '<price>29.99</price>' \
            '</book>' \
            '<book>' \
                '<title>Learning XML</title>' \
                '<price>39.95</price>' \
            '</book>' \
            '<book>' \
                '<title>西游记</title>' \
                '<price>69.95</price>' \
            '</book>' \
            '<book>' \
                '<title>水浒传</title>' \
                '<price>29.95</price>' \
            '</book>' \
            '<book>' \
                '<title>三国演义</title>' \
                '<price>29.95</price>' \
            '</book>' \
        '</bookstore>'


# 2. etree.HTML() 将字符串转换成HTML元素对象,可以自动添加缺失的元素
html = etree.HTML(str)  # <element>  是一个el对象
# print(html)


# 3. 方法:
# 3.1 tostring()  查看转换之后的内容(二进制类型)
# 如果想要查看字符串,需要解码
# 如果想要显示汉字,需要先编码,再解码
# content = etree.tostring(html,encoding='utf-8')
# print(content.decode())


# 3.2 xpath()方法  作用:提取页面数据,返回值是一个列表
# xpath的使用一定是建立在etree.HTML()之后的内容中的

# xpath是如何来提取页面数据的?
# 答:使用的是路径表达式

# 3.2.1 xpath路径分为两种:
# 第一种: /  代表一层层的查找,如果/存在于开头,代表根路径
# bookstore = html.xpath('/html/body/bookstore')
# print(bookstore)  # [<element>]

# 第二种: // 任意路径  焦点在元素身上
# 例如：查找bookstore标签
# bookstore = html.xpath('//bookstore')
# print(bookstore)  # [<element>]

# 第一种和第二种结合
# 例如：查找所有book标签
# book = html.xpath('//bookstore/book')
# print(book)  # [<element>, <element>, <element>, <element>, <element>]

# 3.2.2 /text()  获取标签之间的内容
# 例如：获取所有title标签的内容
# 步骤：
# 1. 找到所有title标签
# 2. 获取内容
# title = html.xpath('//book/title/text()')
# print(title)  # ['Harry Potter', 'Learning XML', '西游记', '水浒传', '三国演义']

# 3.3 位于  使用[]  可以理解成条件
# 3.3.1 [n] 代表获取第n个元素,n是数字,n / = / 3]/title/text()')
# print(title)  # ['水浒传', '三国演义']
# ? title = html.xpath('//book[position()>last()-2]/title/text()')
# print(title)  # ['水浒传', '三国演义']

# 3.3.3 获取属性值:@属性名

# 例如: 获取lang属性值为cng的title标签的内容
# title = html.xpath('//book/title[@lang="cng"]/text()')
# print(title)  # ['西游记']

# 例如: 获取包含src属性得title标签的内容
# title = html.xpath('//book/title[@src]/text()')
# print(title)  # ['Harry Potter', '水浒传', '三国演义']

# 例如: 获取包含属性的title标签的内容
# title = html.xpath('//book/title[@*]/text()')
# print(title)  # ['Harry Potter', 'Learning XML', '西游记', '水浒传', '三国演义']

# 例如: 获取最后一个title标签的src属性的值
# title = html.xpath('//book[last()]/title/@src')
# print(title)  # ['https://www.jd.com']

# 例如: 获取所有包含src属性的标签之间的内容
# node = html.xpath('//*[@src]/text()')
# print(node)  # ['Harry Potter', '水浒传', '三国演义']


# 3.4 and  与  连接的是谓语(条件)
# 例如: 获取lang="dng"并且class="t1"的title标签的内容
# title = html.xpath('//book/title[@lang="dng" and @class="t1"]/text()')
# title1 = html.xpath('//book/title[@lang="dng"][@class="t1"]/text()')
# print(title)  # ['三国演义']
# print(title1)  # ['三国演义']


# 3.5 or  或  连接谓语
# 例如: 查找lang="cng"或者lang="bng"的title标签的内容
# title = html.xpath('//book/title[@lang="cng" or @lang="bng"]/text()')
# print(title)  # ['Harry Potter', '西游记']


# 3.6 |  连接路径
# 例如: 获取所有title标签和price标签之间的内容
# title = html.xpath('//title/text() | //price/text()')
# print(title)  # ['Harry Potter', '29.99', 'Learning XML', '39.95', '西游记', '69.95', '水浒传', '29.95', '三国演义', '29.95']


# 3.8 parse()  作用:从文件中读取数据
# 注意: 读取的文件,必须满足xml格式**(不存在单标签,全部都是上标签)**
content = etree.parse('test.html')
# print(content)  # <lxml.etree._elementtree>
res = etree.tostring(content,encoding='utf-8')
print(res.decode())  
nbsp;html>


    <title>test</title>


    <h1>
        这是一个html
    </h1>

</lxml.etree._elementtree></element></element></element></element></element></element></element></element>

相关免费学习推荐：python视频教程

The above is the detailed content of python Xpath syntax. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:CSDN. If there is any infringement, please contact admin@php.cn delete

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python vs. C : Memory Management and ControlApr 19, 2025 am 12:17 AM

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python for Scientific Computing: A Detailed LookApr 19, 2025 am 12:15 AM

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Python and C : Finding the Right ToolApr 19, 2025 am 12:04 AM

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python for Data Science and Machine LearningApr 19, 2025 am 12:02 AM

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

See all articles