Home > Article > Backend Development > Comprehensive understanding of Python crawler xlml parsing library
The following editor will bring you an article about the xlml parsing library of Python crawler (comprehensive understanding). The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor to take a look
1.Xpath
Xpath is a language for finding information in XML, which can be used to Traverse elements and attributes in XML documents. Both XQuery and xpoint are built on xpath expression
2. Node
Parent (parent), child (children), brother ( sibling), ancestor (ancetstor), descendant (Decendant)
3. Select node
Path expression
Expression | Description | Path expression | Result |
nodename | Select all child nodes on this node | bookstore | Select all child nodes of the bookstore element |
/ | Select | /bookstore from the root node | Select the root element bookstore, which is the absolute path |
// | Select nodes in the document from the current node matching the selection, regardless of position | //book | Select all book child elements regardless of their position in the document |
. | Select the current node | bookstore//book | Select all book elements in the bookstore descendants |
.. | Select the parent node of the current node | ||
Select attributes | //@lang | Select all attributes named lang |
Predicate
The predicate is used tofind a specific node or a node containing a specified value
The predicate is embedded in square bracketsResult | |
Select the bookstore child The first book element of the element | |
Select the last book element that belongs to the bookstore child element | |
Select the penultimate book element that belongs to the bookstore child element | |
Select all book elements of the bookstore element, and the price value is greater than 35.0 |
* Match any element node
@* Match any attribute nodenode() Match any type of node 4.lxml usage#!/usr/bin/python #_*_coding:utf-8_*_ from lxml import etree text=''' <p> <ul> <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" >second item</a></li> <li class="item-inactive"><a href="link3.html" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" >fifth item</a> </ul> </p> ''' # html=etree.HTML(text) #html对象,存储在地址中,有自动修正功能 # result=etree.tostring(html) #将html对象转化为字符串 html=etree.parse('hello.html') # result=etree.tostring(html,pretty_print=True) # print result print type(html) result= html.xpath('//li') print result print len(result) print type(result) print type(result[0]) print html.xpath('//li/@class') # 获取li标签下的所有的class print html.xpath('//li/a[@href="link1.html" rel="external nofollow" rel="external nofollow" ]') #获取li标签下href为link1的<a>标签 print html.xpath('//li//span') #获取li标签下所有的span标签 print html.xpath('//li[last()-1]/a')[0].text #获取倒数第二个元素的内容
The above is the detailed content of Comprehensive understanding of Python crawler xlml parsing library. For more information, please follow other related articles on the PHP Chinese website!