search
HomeWeb Front-endHTML TutorialHow to parse HTML with lxml

How to parse HTML with lxml

Mar 12, 2017 pm 05:51 PM

This article introduces the method of parsing HTML with lxml

First demonstrate a code example for obtaining a page link:

#coding=utf-8
from lxml import etree
html = '''
<html>
  <head>
    <meta name="content-type" content="text/html; charset=utf-8" />
    <title>友情链接查询 - 站长工具</title>
    <!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc -->
    <meta name="Keywords" content="友情链接查询" />
    <meta name="Description" content="友情链接查询" />
  </head>
  <body>
    <h1 id="Top-nbsp-News">Top News</h1>
    <p style="font-size: 200%">World News only on this page</p>
    Ah, and here&#39;s some more text, by the way.
    <p>... and this is a parsed fragment ...</p>
    <a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年发展基金会</a> 
    <a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王国</a> 
    <a href="http://www.4399.com/flash/35538.htm" target="_blank">奥拉星</a> 
    <a href="http://game.3533.com/game/" target="_blank">手机游戏</a>
    <a href="http://game.3533.com/tupian/" target="_blank">手机壁纸</a>
    <a href="http://www.4399.com/" target="_blank">4399小游戏</a> 
    <a href="http://www.91wan.com/" target="_blank">91wan游戏</a>
  </body>
</html>
&#39;&#39;&#39;
page = etree.HTML(html.lower().decode(&#39;utf-8&#39;))
hrefs = page.xpath(u"//a")
for href in hrefs:
  print href.attrib

Print out The result is:

{'href': 'http://www.cydf.org.cn/', 'target': '_blank', 'rel': 'nofollow'}
{ 'href': 'http://www.4399.com/flash/32979.htm', 'target': '_blank'}
{'href': 'http://www.4399.com/flash /35538.htm', 'target': '_blank'}
{'href': 'http://game.3533.com/game/', 'target': '_blank'}
{' href': 'http://game.3533.com/tupian/', 'target': '_blank'}
{'href': 'http://www.4399.com/', 'target' : '_blank'}
{'href': 'http://www.91wan.com/', 'target': '_blank'}

If you want to get The content between a>,

for href in hrefs:

print href.text

The result is:

Youth Development Foundation
Locke Kingdom
Aola Star
Mobile Game
Mobile Wallpaper
4399 Mini Game
91wan Game

Things to note before using lxml: Make sure html first After utf-8 decoding, that is, code = html.decode('utf-8', 'ignore'), otherwise parsing errors will occur. Because Chinese is encoded into utf-8 and then becomes a form like '/u2541', lxml will consider the tag to end when it encounters "/".

XPATH basically uses a directory tree-like method to describe the path in the XML document. For example, use "/" as the separation between upper and lower levels. The first "/" represents the root node of the document (note, it does not refer to the outermost tag node of the document, but to the document itself). For example, for an HTML file, the outermost node should be "/html".

To locate a certain HTML tag, you can use an absolute path similar to the file path, such as page.xpath(u"/html/body/p"), which will find the body node All p tags; you can also use a relative path similar to the file path, you can use it like this: page.xpath(u"//p"), it will find all p tags in the entire html code:

  

World News only on this page


  Ah, and here's some more text, by the way.
  

.. . and this is a parsed fragment ...

Note: XPATH does not necessarily return the only node, but all nodes that meet the conditions. As shown above, as long as it is the p tag in the body, whether it is the first-level node, second-level, or third-level node of the body, it will be taken out.

If you want to further narrow the scope and directly locate "

World News only on this page

" what should you do? This requires adding filter conditions. The method of filtering is to use "[""]" to add filter conditions. There is a filtering syntax in lxml:

p = page.xpath(u"/html/body/p[@style='font-size: 200%']")

Or: p = page.xpath(u"//p[@style='font-size:200%']")

In this way, the p node with style font-size:200% in the body is taken out. Note: This p variable is a list of lxml.etree._Element objects, and the result of p[0].text is World News only on this page, that is, the value between tags; p The result of [0].values() is font-size: 200%, that is, all attribute values. Among them, @style represents the attribute style. Similarly, you can also use @name, @id, @value, @href, @src, @class....

If there is no such thing in the tag What to do with attributes? Then you can use text(), position() and other functions to filter. The function text() means to get the text contained in the node. For example:

hello

world

, use "p[text()='hello']" to get the p, and world is the text() of p . The function position() means to obtain the position of the node. For example, "li[position()=2]" means to obtain the second li node, which can also be omitted as "li[2]".

But what should be noted is the order of digital positioning and filtering conditions. For example, "ul/li[5][@name='hello']" means to take the fifth item li under ul, and its name must be hello, otherwise it will return empty. If you use "ul/li[@name='hello'][5]", the meaning is different. It means to find the fifth li node with the name "hello" under ul.

  此外,“*”可以代替所有的节点名,比如用"/html/body/*/span"可以取出body下第二级的所有span,而不管它上一级是p还是p或是其它什么东东。

而 “descendant::”前缀可以指代任意多层的中间节点,它也可以被省略成一个“/”。比如在整个HTML文档中查找id为“leftmenu”的 p,可以用“/descendant::p[@id='leftmenu']”,也可以简单地使用“ //p[@id='leftmenu']”。

text = page.xpath(u"/descendant::*[text()]")表示任意多层的中间节点下任意标签之间的内容,也即实现蜘蛛抓取页面内容功能。以下内容使用text属性是取不到的:

<p class="news">
    1. <b>无流量站点清理公告</b>  2013-02-22<br />
    取不到的内容
    </p>
    <p class="news">
    2. <strong>无流量站点清理公告</strong>  2013-02-22<br />
取不到的内容
</p> <p class="news"> 3. <span>无流量站点清理公告</span>  2013-02-22<br />
取不到的内容
</p> <p class="news"> 4. <u>无流量站点清理公告</u>  2013-02-22<br />
取不到的内容
</p>

这些“取不到的内容”使用这个是取不到的。怎么办呢?别担心,lxml还有一个属性叫做“tail”,它的意思是结束节点前面的内容,也就是说在“
”与“

”之间的内容。它的源码里面的意思是“text after end tag”

  至于“following-sibling::”前缀就如其名所说,表示同一层的下一个节点。"following-sibling::*"就是任意下一个节点,而“following-sibling::ul”就是下一个ul节点。

  如果script与style标签之间的内容影响解析页面,或者页面很不规则,可以使用lxml.html.clean模块。模块 lxml.html.clean 提供 一个Cleaner 类来清理 HTML 页。它支持删除嵌入或脚本内容、 特殊标记、 CSS 样式注释或者更多。

  cleaner = Cleaner(style=True, scripts=True,page_structure=False, safe_attrs_only=False)

  print cleaner.clean_html(html)

  注意,page_structure,safe_attrs_only为False时保证页面的完整性,否则,这个Cleaner会把你的html结构与标签里的属性都给清理了。使用Cleaner类要十分小心,小心擦枪走火。

 

  忽略大小写可以:

  page = etree.HTML(html)
  keyword_tag = page.xpath("//meta[translate(@name,'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='keywords']")


The above is the detailed content of How to parse HTML with lxml. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
The Future of HTML, CSS, and JavaScript: Web Development TrendsThe Future of HTML, CSS, and JavaScript: Web Development TrendsApr 19, 2025 am 12:02 AM

The future trends of HTML are semantics and web components, the future trends of CSS are CSS-in-JS and CSSHoudini, and the future trends of JavaScript are WebAssembly and Serverless. 1. HTML semantics improve accessibility and SEO effects, and Web components improve development efficiency, but attention should be paid to browser compatibility. 2. CSS-in-JS enhances style management flexibility but may increase file size. CSSHoudini allows direct operation of CSS rendering. 3.WebAssembly optimizes browser application performance but has a steep learning curve, and Serverless simplifies development but requires optimization of cold start problems.

HTML: The Structure, CSS: The Style, JavaScript: The BehaviorHTML: The Structure, CSS: The Style, JavaScript: The BehaviorApr 18, 2025 am 12:09 AM

The roles of HTML, CSS and JavaScript in web development are: 1. HTML defines the web page structure, 2. CSS controls the web page style, and 3. JavaScript adds dynamic behavior. Together, they build the framework, aesthetics and interactivity of modern websites.

The Future of HTML: Evolution and Trends in Web DesignThe Future of HTML: Evolution and Trends in Web DesignApr 17, 2025 am 12:12 AM

The future of HTML is full of infinite possibilities. 1) New features and standards will include more semantic tags and the popularity of WebComponents. 2) The web design trend will continue to develop towards responsive and accessible design. 3) Performance optimization will improve the user experience through responsive image loading and lazy loading technologies.

HTML vs. CSS vs. JavaScript: A Comparative OverviewHTML vs. CSS vs. JavaScript: A Comparative OverviewApr 16, 2025 am 12:04 AM

The roles of HTML, CSS and JavaScript in web development are: HTML is responsible for content structure, CSS is responsible for style, and JavaScript is responsible for dynamic behavior. 1. HTML defines the web page structure and content through tags to ensure semantics. 2. CSS controls the web page style through selectors and attributes to make it beautiful and easy to read. 3. JavaScript controls web page behavior through scripts to achieve dynamic and interactive functions.

HTML: Is It a Programming Language or Something Else?HTML: Is It a Programming Language or Something Else?Apr 15, 2025 am 12:13 AM

HTMLisnotaprogramminglanguage;itisamarkuplanguage.1)HTMLstructuresandformatswebcontentusingtags.2)ItworkswithCSSforstylingandJavaScriptforinteractivity,enhancingwebdevelopment.

HTML: Building the Structure of Web PagesHTML: Building the Structure of Web PagesApr 14, 2025 am 12:14 AM

HTML is the cornerstone of building web page structure. 1. HTML defines the content structure and semantics, and uses, etc. tags. 2. Provide semantic markers, such as, etc., to improve SEO effect. 3. To realize user interaction through tags, pay attention to form verification. 4. Use advanced elements such as, combined with JavaScript to achieve dynamic effects. 5. Common errors include unclosed labels and unquoted attribute values, and verification tools are required. 6. Optimization strategies include reducing HTTP requests, compressing HTML, using semantic tags, etc.

From Text to Websites: The Power of HTMLFrom Text to Websites: The Power of HTMLApr 13, 2025 am 12:07 AM

HTML is a language used to build web pages, defining web page structure and content through tags and attributes. 1) HTML organizes document structure through tags, such as,. 2) The browser parses HTML to build the DOM and renders the web page. 3) New features of HTML5, such as, enhance multimedia functions. 4) Common errors include unclosed labels and unquoted attribute values. 5) Optimization suggestions include using semantic tags and reducing file size.

Understanding HTML, CSS, and JavaScript: A Beginner's GuideUnderstanding HTML, CSS, and JavaScript: A Beginner's GuideApr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools