爬蟲的解析方式五：XPath-Python教學-PHP中文網

首頁

後端開發

Python教學

爬蟲的解析方式五：XPath

爱喝马黛茶的安东尼

Jun 05, 2019 pm 03:36 PM

pythonxpath爬蟲

眾多語言都能進行爬蟲，但基於python的爬蟲顯得更加簡潔，方便。爬蟲也成了python語言中不可或缺的一部分。爬蟲的解析方式也是多樣化。上一篇告訴大家的是爬蟲的解析方式四：PyQuery，今天帶給大家的是另一種方式，XPath。

爬蟲的解析方式五：XPath

python爬蟲之xpath的基本使用

一、簡介

　　XPath 是一門在XML 文件中尋找資訊的語言。 XPath 可用於在 XML 文件中對元素和屬性進行遍歷。 XPath 是 W3C XSLT 標準的主要元素，而 XQuery 和 XPointer 都建構在 XPath 表達之上。

二、安裝

pip3 install lxml

#三、使用

　　11 、導入

from lxml import etree

　　2、基本使用

from lxml import etree

wb_data = """
        <div>
            <ul>
                 <li class="item-0"><a href="link1.html">first item</a></li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-inactive"><a href="link3.html">third item</a></li>
                 <li class="item-1"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a>
             </ul>
         </div>
        """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))

　　從下面的結果來看，我們印表機html其實就是一個python對象，etree.tostring(html)則是一個python對象，etree.tostring(html)則是一個python對象，etree.tostring(html)則是一個python對象，etree.tostring(html)則是一個python對象，etree.tostring(html)不全裡html的基本寫法，補全了缺手臂少腿的標籤。

<Element html at 0x39e58f0>
<html><body><div>
            <ul>
                 <li class="item-0"><a href="link1.html">first item</a></li>
                 <li class="item-1"><a href="link2.html">second item</a></li>
                 <li class="item-inactive"><a href="link3.html">third item</a></li>
                 <li class="item-1"><a href="link4.html">fourth item</a></li>
                 <li class="item-0"><a href="link5.html">fifth item</a>
             </li></ul>
         </div>
        </body></html>

　　3、取得某個標籤的內容(基本上使用)，注意，取得a標籤的所有內容，a後面就不用再加正斜杠，否則報錯。

　　寫法一

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;/html/body/div/ul/li/a&#39;)
print(html)
for i in html_data:
    print(i.text)

<Element html at 0x12fe4b8>
first item
second item
third item
fourth item
fifth item

　　寫法二（直接在需要尋找內容的標籤後面加上一個/text()就行）

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;/html/body/div/ul/li/a/text()&#39;)
print(html)
for i in html_data:
    print(i)

<Element html at 0x138e4b8>
first item
second item
third item
fourth item
fifth item

　　4、開啟讀取html檔案

#使用parse打开html的文件
html = etree.parse(&#39;test.html&#39;)
html_data = html.xpath(&#39;//*&#39;)<br>#打印是一个列表，需要遍历
print(html_data)
for i in html_data:
    print(i.text)

html = etree.parse(&#39;test.html&#39;)
html_data = etree.tostring(html,pretty_print=True)
res = html_data.decode(&#39;utf-8&#39;)
print(res)

列印：

<div>
     <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
</div>

#　　5、列印指定路徑下a標籤的屬性（可以透過遍歷拿到某一標籤個屬性的值，找出標籤的內容）

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;/html/body/div/ul/li/a/@href&#39;)
for i in html_data:
    print(i)

列印：

link1.html
link2.html
link3.html
link4.html
link5.html

　　6、我們知道我們使用xpath拿到得都是一個個的ElementTree對象，所以如果需要找內容的話，還需要遍歷拿到資料的清單。

　　查到絕對路徑下a標籤屬性等於link2.html的內容。

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;/html/body/div/ul/li/a[@href="link2.html"]/text()&#39;)
print(html_data)
for i in html_data:
    print(i)

列印：

['second item']

second item

　　7、上面我們找到全部都是絕對路徑（每一個都是從根開始查找），下面我們查找相對路徑，例如，查找所有li標籤下的a標籤內容。

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;//li/a/text()&#39;)
print(html_data)
for i in html_data:
    print(i)

列印：

[&#39;first item&#39;, &#39;second item&#39;, &#39;third item&#39;, &#39;fourth item&#39;, &#39;fifth item&#39;]
first item
second item
third item
fourth item
fifth item

###　　8、上面我們使用絕對路徑，找出了所有a標籤的屬性等於href屬性值，利用的是/---絕對路徑，下面我們使用相對路徑，找出l相對路徑下li標籤下的a標籤下的href屬性的值，注意，a標籤後面需要雙//。 ###

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;//li/a//@href&#39;)
print(html_data)
for i in html_data:
    print(i)

###　列印：######

[&#39;link1.html&#39;, &#39;link2.html&#39;, &#39;link3.html&#39;, &#39;link4.html&#39;, &#39;link5.html&#39;]
link1.html
link2.html
link3.html
link4.html
link5.html

###　9、相對路徑下跟絕對路徑下查特定屬性的方法類似，也可以說相同。 ###

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;//li/a[@href="link2.html"]&#39;)
print(html_data)
for i in html_data:
    print(i.text)

###印刷：###

[<Element a at 0x216e468>]
second item

　　10、查找最后一个li标签里的a标签的href属性

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;//li[last()]/a/text()&#39;)
print(html_data)
for i in html_data:
    print(i)

打印：

[&#39;fifth item&#39;]
fifth item

　　11、查找倒数第二个li标签里的a标签的href属性

html = etree.HTML(wb_data)
html_data = html.xpath(&#39;//li[last()-1]/a/text()&#39;)
print(html_data)
for i in html_data:
    print(i)

打印：

[&#39;fourth item&#39;]
fourth item

　　12、如果在提取某个页面的某个标签的xpath路径的话，可以如下图：

　　//*[@id="kw"]

　　解释：使用相对路径查找所有的标签，属性id等于kw的标签。

爬蟲的解析方式五：XPath

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li><a id=&#39;i1&#39; href="link.html">first item</a></li>
            <li><a id=&#39;i2&#39; href="llink.html">first item</a></li>
            <li><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url=&#39;http://example.com&#39;, body=html,encoding=&#39;utf-8&#39;)
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[2]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[@href="link.html"][@id="i1"]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[contains(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[starts-with(@href, "link")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]&#39;)
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/text()&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//a[re:test(@id, "i\d+")]/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;/html/body/ul/li/a/@href&#39;).extract()
# print(hxs)
# hxs = Selector(response=response).xpath(&#39;//body/ul/li/a/@href&#39;).extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath(&#39;//body/ul/li&#39;)
# for item in ul_list:
#     v = item.xpath(&#39;./a/span&#39;)
#     # 或
#     # v = item.xpath(&#39;a/span&#39;)
#     # 或
#     # v = item.xpath(&#39;*/a/span&#39;)
#     print(v)

以上是爬蟲的解析方式五：XPath的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文轉載於：CSDN。如有侵權，請聯絡admin@php.cn刪除

Numpy數組與使用數組模塊創建的數組有何不同？Apr 24, 2025 pm 03:53 PM

numpyArraysareAreBetterFornumericalialoperations andmulti-demensionaldata，而learthearrayModuleSutableforbasic，內存效率段

Numpy數組的使用與使用Python中的數組模塊陣列相比如何？Apr 24, 2025 pm 03:49 PM

numpyArraySareAreBetterForHeAvyNumericalComputing，而lelethearRayModulesiutable-usemoblemory-connerage-inderabledsswithSimpleDatateTypes.1）NumpyArsofferVerverVerverVerverVersAtility andPerformanceForlargedForlargedAtatasetSetsAtsAndAtasEndCompleXoper.2）

CTYPES模塊與Python中的數組有何關係？Apr 24, 2025 pm 03:45 PM

ctypesallowscreatingingangandmanipulatingc-stylarraysinpython.1）usectypestoInterfacewithClibrariesForperfermance.2）createc-stylec-stylec-stylarraysfornumericalcomputations.3）passarraystocfunctions foreforfunctionsforeffortions.however.however，However，HoweverofiousofmemoryManageManiverage，Pressiveo，Pressivero

在Python的上下文中定義'數組”和'列表”。Apr 24, 2025 pm 03:41 PM

Inpython，一個“列表” isaversatile，mutableSequencethatCanholdMixedDatateTypes，而“陣列” isamorememory-sepersequeSequeSequeSequeSequeRingequiringElements.1）列表

Python列表是可變還是不變的？那Python陣列呢？Apr 24, 2025 pm 03:37 PM

pythonlistsandArraysareBothable.1）列表Sareflexibleandsupportereceneousdatabutarelessmory-Memory-Empefficity.2）ArraysareMoremoremoremoreMemoremorememorememorememoremorememogeneSdatabutlesserversEversementime，defteringcorcttypecrecttypececeDepeceDyusagetoagetoavoavoiDerrors。