Web Scraping with Python深入HTML解析_html/css_WEB-ITnose-html教程-PHP中文网

首页

web前端

html教程

Web Scraping with Python深入HTML解析_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 21, 2016 am 08:55 AM

有人问米开朗基罗："您是如何创造出《大卫》这样的巨作的？"他答道："很简单，我去采石场，看见一块巨大的大理石，我要做的只是凿去那些不该有的大理石，大卫就诞生了。

同样我们在抓取网页的时候，需要去掉我们不需要的，提取出需要的信息，只不过技术相当复杂。这篇文章将介绍HTML解析技术

在上篇文章（ Web Scraping with Python--第一个网页抓取实例）中，我们初步接触了BeutifulSoup库, 这里我们将通过属性来查找标签tags。

几乎所有的网站都包含CSS，对我们抓取网页很有利，CSS依赖于不同的HTML元素有不同的标记，比如：

来看一个网站-http://www.pythonscraping.com/pages/warandpeace.html，里面是一篇文章，口语是红色的字体，而讲话者是绿色的字体，选取其中一个源代码片段：

"Heavens! what a virulent attack!" replied the prince, not in the least disconcerted by this reception.

可以使用上一篇文章中使用的程序来创建一个BeautifulSoup对象来获取整个网页：

from urllib.requestimport urlopenfrom bs4import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html)

使用BeautifulSoup对象的findAll方法来提取出一个指定要求的列表

nameList = bsObj.findAll("span", {"class":"green"})for namein nameList:    print(name.get_text())

将上面的代码证整理一下：

from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html, "html.parser")nameList = bsObj.findAll("span", {"class": "green"})for namein nameList:    print(name.get_text())

运行结果：

Anna

Pavlovna Scherer

Empress Marya

……

解释一下上面的代码：

bsObj.findAll(tagName, tagAttributes) 获取整个页面上的标签的列表，然后通过迭代列表，获取相应的标签的内容

find() 和 findAll()

这两个方法很相似，它们的声明如下：

findAll(tag, attributes, recursive, text, limit, keywords)find(tag, attributes, recursive, text, keywords)

tag参数就像之前见到的那样，你可以传递一个字符串或者一个字符串列表：.findAll({"h1","h2","h3","h4","h5","h6"})

attributes参数传递一个属性和tags相匹配的字典，例如:.findAll("span", {"class":"green", "class":"red"})

recursive参数用于设置是否设置递归

keywor参数允许你包含一个特别的属性，例如：

from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html, "html.parser")allText = bsObj.findAll(id="text")#也可以换为：allText = bsObj.findAll("",{"id":"text"})print(allText[0].get_text())

如果你想查找子标签，可以使用children:

from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")for childin bsObj.find("table", {"id": "giftList"}).children:    print(child)

如果想去掉第一行的

内容，可以使用next_siblings

from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")for siblingin bsObj.find("table", {"id":"giftList"}).tr.next_siblings:    print(sibling)

如果你想查找父标签，可以使用 previous_siblings:

from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

从下面的html结构一目了然

—

(3)

— “$15.00” (4)

— s

(2)

— Web Scraping with Python深入HTML解析_html/css_WEB-ITnose (1)

正则表达式与 BeautifulSoup

python中的正则可以参照我的另一篇《 Python基础（9）--正则表达式》

注意到上面的实例网页中有如下结构：

Web Scraping with Python深入HTML解析_html/css_WEB-ITnose

假如有个需求是提取所有的img标签，按照之前的说法，可以考虑 findAll("img")来解决这个问题，但是现代网站有的隐藏img……等不确定因素，这时候才有正则表达式来解决：

from urllib.requestimport urlopenfrom bs4import BeautifulSoupimport re html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})for imagein images:    print(image["src"])

运行结果如下：

../img/gifts/img1.jpg

../img/gifts/img2.jpg

../img/gifts/img3.jpg

../img/gifts/img4.jpg

../img/gifts/img6.jpg

作者：工学1号馆

出处： http://wuyudong.com/1842.html

本文版权归作者所有，欢迎转载，在文章页面明显位置给出原文链接，否则保留追究法律责任的权利.

如果觉得本文对您有帮助，可以对作者进行小额【赞助】

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

为什么HTML属性对Web开发很重要？May 12, 2025 am 12:01 AM

htmlattributesarecrucialinwebdevelopment forcontrollingBehavior，外观和功能

Alt属性的目的是什么？为什么重要？May 11, 2025 am 12:01 AM

alt属性是HTML中标签的重要部分，用于提供图片的替代文本。1.当图片无法加载时，alt属性中的文本会显示，提升用户体验。2.屏幕阅读器使用alt属性帮助视障用户理解图片内容。3.搜索引擎索引alt属性中的文本，提高网页的SEO排名。

HTML，CSS和JavaScript：示例和实际应用May 09, 2025 am 12:01 AM

HTML、CSS和JavaScript在网页开发中的作用分别是：1.HTML用于构建网页结构；2.CSS用于美化网页外观；3.JavaScript用于实现动态交互。通过标签、样式和脚本，这三者共同构筑了现代网页的核心功能。

如何在标签上设置lang属性？为什么这很重要？May 08, 2025 am 12:03 AM

设置标签的lang属性是优化网页可访问性和SEO的关键步骤。1)在标签中设置lang属性，如。2)在多语言内容中，为不同语言部分设置lang属性，如。3)使用符合ISO639-1标准的语言代码，如"en"、"fr"、"zh"等。正确设置lang属性可以提高网页的可访问性和搜索引擎排名。

HTML属性的目的是什么？May 07, 2025 am 12:01 AM

htmlattributeseresene forenhancingwebelements'functionalityandAppearance.TheyAdDinformationTodeFineBehavior，外观和互动，使网站互动，响应式，visalalyAppealing.AttributesLikutesLikeSlikEslikesrc，href，href，href，类，类型，类型，和dissabledtransfransformformformformformformformformformformformformformformforment

您如何在HTML中创建列表？May 06, 2025 am 12:01 AM

toCreateAlistinHtml，useforforunordedlistsandfororderedlists：1）forunorderedlists，wrapitemsinanduseforeachItem，RenderingeringAsabulleTedList.2）fororderedlists，useandfornumberedlists，useandfornumberedlists，casundfornumberedlists，customeizableWithTheTtheTthetTheTeTeptTributeFordTributeForderForderForderFerentNumberingSnumberingStyls。

HTML行动：网站结构的示例May 05, 2025 am 12:03 AM

HTML用于构建结构清晰的网站。1)使用标签如、、定义网站结构。2)示例展示了博客和电商网站的结构。3)避免常见错误如标签嵌套不正确。4)优化性能通过减少HTTP请求和使用语义化标签。

您如何将图像插入HTML页面？May 04, 2025 am 12:02 AM

toinsertanimageIntoanhtmlpage，usethetagwithsrcandaltattributes.1）usealttextforAcccessibilityandseo.2）instementRcsetForresponSiveImages.3）applylazyloadingWithLoadingWithLoading =“ lazy” tooptimizeperformance.4）tooptimizeperformance.4）

See all articles