Home >Web Front-end >HTML Tutorial >How to parse HTML with lxml

How to parse HTML with lxml

高洛峰
高洛峰Original
2017-03-12 17:51:272003browse

This article introduces the method of parsing HTML with lxml

First demonstrate a code example for obtaining a page link:

#coding=utf-8
from lxml import etree
html = '''
<html>
  <head>
    <meta name="content-type" content="text/html; charset=utf-8" />
    <title>友情链接查询 - 站长工具</title>
    <!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc -->
    <meta name="Keywords" content="友情链接查询" />
    <meta name="Description" content="友情链接查询" />
  </head>
  <body>
    <h1 class="heading">Top News</h1>
    <p style="font-size: 200%">World News only on this page</p>
    Ah, and here&#39;s some more text, by the way.
    <p>... and this is a parsed fragment ...</p>
    <a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年发展基金会</a> 
    <a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王国</a> 
    <a href="http://www.4399.com/flash/35538.htm" target="_blank">奥拉星</a> 
    <a href="http://game.3533.com/game/" target="_blank">手机游戏</a>
    <a href="http://game.3533.com/tupian/" target="_blank">手机壁纸</a>
    <a href="http://www.4399.com/" target="_blank">4399小游戏</a> 
    <a href="http://www.91wan.com/" target="_blank">91wan游戏</a>
  </body>
</html>
&#39;&#39;&#39;
page = etree.HTML(html.lower().decode(&#39;utf-8&#39;))
hrefs = page.xpath(u"//a")
for href in hrefs:
  print href.attrib

Print out The result is:

{'href': 'http://www.cydf.org.cn/', 'target': '_blank', 'rel': 'nofollow'}
{ 'href': 'http://www.4399.com/flash/32979.htm', 'target': '_blank'}
{'href': 'http://www.4399.com/flash /35538.htm', 'target': '_blank'}
{'href': 'http://game.3533.com/game/', 'target': '_blank'}
{' href': 'http://game.3533.com/tupian/', 'target': '_blank'}
{'href': 'http://www.4399.com/', 'target' : '_blank'}
{'href': 'http://www.91wan.com/', 'target': '_blank'}

If you want to get3499910bf9dac5ae3c52d5ede73834853ffaa6f39089831721fe56edabc9cc7a,

for href in hrefs:

print href.text

The result is:

Youth Development Foundation
Locke Kingdom
Aola Star
Mobile Game
Mobile Wallpaper
4399 Mini Game
91wan Game

Things to note before using lxml: Make sure html first After utf-8 decoding, that is, code = html.decode('utf-8', 'ignore'), otherwise parsing errors will occur. Because Chinese is encoded into utf-8 and then becomes a form like '/u2541', lxml will consider the tag to end when it encounters "/".

XPATH basically uses a directory tree-like method to describe the path in the XML document. For example, use "/" as the separation between upper and lower levels. The first "/" represents the root node of the document (note, it does not refer to the outermost tag node of the document, but to the document itself). For example, for an HTML file, the outermost node should be "/html".

To locate a certain HTML tag, you can use an absolute path similar to the file path, such as page.xpath(u"/html/body/p"), which will find the body node All p tags; you can also use a relative path similar to the file path, you can use it like this: page.xpath(u"//p"), it will find all p tags in the entire html code:

  90ce244b5a5321cadd932bf501f122e4World News only on this page94b3e26ee717c64999d7867364b1b4a3
  Ah, and here's some more text, by the way.
  e388a4556c0f65e1904146cc1a846bee.. . and this is a parsed fragment ...94b3e26ee717c64999d7867364b1b4a3

Note: XPATH does not necessarily return the only node, but all nodes that meet the conditions. As shown above, as long as it is the p tag in the body, whether it is the first-level node, second-level, or third-level node of the body, it will be taken out.

If you want to further narrow the scope and directly locate "90ce244b5a5321cadd932bf501f122e4World News only on this page94b3e26ee717c64999d7867364b1b4a3" what should you do? This requires adding filter conditions. The method of filtering is to use "[""]" to add filter conditions. There is a filtering syntax in lxml:

p = page.xpath(u"/html/body/p[@style='font-size: 200%']")

Or: p = page.xpath(u"//p[@style='font-size:200%']")

In this way, the p node with style font-size:200% in the body is taken out. Note: This p variable is a list of lxml.etree._Element objects, and the result of p[0].text is World News only on this page, that is, the value between tags; p The result of [0].values() is font-size: 200%, that is, all attribute values. Among them, @style represents the attribute style. Similarly, you can also use @name, @id, @value, @href, @src, @class....

If there is no such thing in the tag What to do with attributes? Then you can use text(), position() and other functions to filter. The function text() means to get the text contained in the node. For example: e388a4556c0f65e1904146cc1a846beehelloe388a4556c0f65e1904146cc1a846beeworld94b3e26ee717c64999d7867364b1b4a36fb279ad3fd4344cbdd93aac6ad173ac, use "p[text()='hello']" to get the p, and world is the text() of p . The function position() means to obtain the position of the node. For example, "li[position()=2]" means to obtain the second li node, which can also be omitted as "li[2]".

But what should be noted is the order of digital positioning and filtering conditions. For example, "ul/li[5][@name='hello']" means to take the fifth item li under ul, and its name must be hello, otherwise it will return empty. If you use "ul/li[@name='hello'][5]", the meaning is different. It means to find the fifth li node with the name "hello" under ul.

  此外,“*”可以代替所有的节点名,比如用"/html/body/*/span"可以取出body下第二级的所有span,而不管它上一级是p还是p或是其它什么东东。

而 “descendant::”前缀可以指代任意多层的中间节点,它也可以被省略成一个“/”。比如在整个HTML文档中查找id为“leftmenu”的 p,可以用“/descendant::p[@id='leftmenu']”,也可以简单地使用“ //p[@id='leftmenu']”。

text = page.xpath(u"/descendant::*[text()]")表示任意多层的中间节点下任意标签之间的内容,也即实现蜘蛛抓取页面内容功能。以下内容使用text属性是取不到的:

<p class="news">
    1. <b>无流量站点清理公告</b>  2013-02-22<br />
    取不到的内容
    </p>
    <p class="news">
    2. <strong>无流量站点清理公告</strong>  2013-02-22<br />
取不到的内容
</p> <p class="news"> 3. <span>无流量站点清理公告</span>  2013-02-22<br />
取不到的内容
</p> <p class="news"> 4. <u>无流量站点清理公告</u>  2013-02-22<br />
取不到的内容
</p>

这些“取不到的内容”使用这个是取不到的。怎么办呢?别担心,lxml还有一个属性叫做“tail”,它的意思是结束节点前面的内容,也就是说在“df250b2156c434f3390392d09b1c9563”与“94b3e26ee717c64999d7867364b1b4a3”之间的内容。它的源码里面的意思是“text after end tag”

  至于“following-sibling::”前缀就如其名所说,表示同一层的下一个节点。"following-sibling::*"就是任意下一个节点,而“following-sibling::ul”就是下一个ul节点。

  如果script与style标签之间的内容影响解析页面,或者页面很不规则,可以使用lxml.html.clean模块。模块 lxml.html.clean 提供 一个Cleaner 类来清理 HTML 页。它支持删除嵌入或脚本内容、 特殊标记、 CSS 样式注释或者更多。

  cleaner = Cleaner(style=True, scripts=True,page_structure=False, safe_attrs_only=False)

  print cleaner.clean_html(html)

  注意,page_structure,safe_attrs_only为False时保证页面的完整性,否则,这个Cleaner会把你的html结构与标签里的属性都给清理了。使用Cleaner类要十分小心,小心擦枪走火。

 

  忽略大小写可以:

  page = etree.HTML(html)
  keyword_tag = page.xpath("//meta[translate(@name,'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='keywords']")


The above is the detailed content of How to parse HTML with lxml. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn