Home  >  Article  >  Backend Development  >  Comprehensive understanding of Python crawler xlml parsing library

Comprehensive understanding of Python crawler xlml parsing library

黄舟
黄舟Original
2017-08-08 11:33:081641browse

The following editor will bring you an article about the xlml parsing library of Python crawler (comprehensive understanding). The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor to take a look

1.Xpath

Xpath is a language for finding information in XML, which can be used to Traverse elements and attributes in XML documents. Both XQuery and xpoint are built on xpath expression

2. Node

Parent (parent), child (children), brother ( sibling), ancestor (ancetstor), descendant (Decendant)

3. Select node

Path expression

##@ Select attributes//@langSelect all attributes named lang
Expression Description Path expression Result
nodename Select all child nodes on this node bookstore Select all child nodes of the bookstore element
/ Select /bookstore from the root node Select the root element bookstore, which is the absolute path
// Select nodes in the document from the current node matching the selection, regardless of position //book Select all book child elements regardless of their position in the document
. Select the current node bookstore//book Select all book elements in the bookstore descendants
.. Select the parent node of the current node

Predicate

The predicate is used to

find a specific node or a node containing a specified value

The predicate is embedded in square brackets

Path expressionResult/bookstore/book[1]Select the bookstore child The first book element of the element/bookstore/book[last()]Select the last book element that belongs to the bookstore child element/bookstore/book[last()-1]Select the penultimate book element that belongs to the bookstore child element/bookstore/book [position()72f92fdc62658ddba2b9a39df3b8fb6735.0]##Select unknown nodes (wildcards)
Select all book elements of the bookstore element, and the price value is greater than 35.0

* Match any element node

@* Match any attribute node

node() Match any type of node

4.lxml usage

#!/usr/bin/python
#_*_coding:utf-8_*_

from lxml import etree

text='''
<p>
 <ul>
  <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" >first item</a></li>
  <li class="item-1"><a href="link2.html" rel="external nofollow" >second item</a></li>
  <li class="item-inactive"><a href="link3.html" rel="external nofollow" >third item</a></li>
  <li class="item-1"><a href="link4.html" rel="external nofollow" >fourth item</a></li>
  <li class="item-0"><a href="link5.html" rel="external nofollow" >fifth item</a>
 </ul>
</p>
  &#39;&#39;&#39;

# html=etree.HTML(text) #html对象,存储在地址中,有自动修正功能
# result=etree.tostring(html) #将html对象转化为字符串

html=etree.parse(&#39;hello.html&#39;)
# result=etree.tostring(html,pretty_print=True)
# print result
print type(html)
result= html.xpath(&#39;//li&#39;)
print result
print len(result)
print type(result)
print type(result[0])
print html.xpath(&#39;//li/@class&#39;) # 获取li标签下的所有的class
print html.xpath(&#39;//li/a[@href="link1.html" rel="external nofollow" rel="external nofollow" ]&#39;) #获取li标签下href为link1的<a>标签
print html.xpath(&#39;//li//span&#39;) #获取li标签下所有的span标签
print html.xpath(&#39;//li[last()-1]/a&#39;)[0].text #获取倒数第二个元素的内容

The above is the detailed content of Comprehensive understanding of Python crawler xlml parsing library. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn