Home  >  Article  >  Backend Development  >  Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup

Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup

零下一度
零下一度Original
2017-05-27 11:55:392162browse

Beautiful Soup is a library in python. Its main function is to grab data from web pages. The following article mainly introduces you to the relevant information of BeautifulSoup, the HTML text parsing library of python crawler. The introduction in the article is very detailed and has certain reference and learning value for everyone. Friends who need it can take a look below.

Preface

The third article in the python crawler series introduces the network request library artifact Requests. After the request returns the data, it must be extracted. Target data, the content returned by different websites usually has many different formats, one is the json format, this type of data is the most friendly to developers. Another XML format, and the most common format is HTML documents. Today I will talk about how to extract interesting data from HTML

Write your own HTML parsing Analyzer? Or use regular expression? None of these are the best solutions. Fortunately, the Python community has already had a very mature solution for this problem. BeautifulSoup is the nemesis of this type of problem. It focuses on HTML document operations and its name comes from a song of the same name by Lewis Carroll. poetry.

BeautifulSoup is a Python library for parsing HTML documents. Through BeautifulSoup, you only need to use very little code to extract any interesting content in HTML. In addition, it also has certain HTML fault tolerance. Ability to handle an incompletely formatted HTML document correctly.

Install BeautifulSoup

pip install beautifulsoup4

BeautifulSoup3 has been officially abandoned for maintenance, you need to download the latest version BeautifulSoup4.

HTML tag

Before learning BeautifulSoup4, it is necessary to have a basic understanding of HTML documents. As shown in the following code, HTML is a tree organizational structure .

<html> 
 <head>
  <title>hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p>如何使用BeautifulSoup</p>
 <body>
</html>
  • It consists of many tags (Tag), such as html, head, title, etc. are all tags

  • A tag pair constitutes a Node, for example... is a root node

  • There is a certain relationship between nodes, such as h1 and p are neighbors of each other, they are adjacent sibling nodes

  • h1 Is the direct child node of body or the descendants node of html

  • body is the parent node of p , html is the ancestor node of p

  • The string nested between the tags is a special child node under the node, such as "hello, world" is also a node , but it has no name.

Using BeautifulSoup

Constructing a BeautifulSoup object requires two parameters. The first parameter is the HTML to be parsed. A text string, the second parameter tells BeautifulSoup which parser to use to parse the HTML.

The parser is responsible for parsing HTML into related objects, while BeautifulSoup is responsible for manipulating data (adding, deleting, modifying, and checking). "html.parser" is Python's built-in parser, and "lxml" is a parser developed based on c language. It executes faster, but it requires additional installation.

It can be located through the BeautifulSoup object to any tag node in HTML.

from bs4 import BeautifulSoup 
text = """
<html> 
 <head>
  <title >hello, world</title>
 </head>
 <body>
  <h1>BeautifulSoup</h1>
  <p class="bold">如何使用BeautifulSoup</p>
  <p class="big" id="key1"> 第二个p标签</p>
  <a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
 </body>
</html> 
"""
soup = BeautifulSoup(text, "html.parser")

# title 标签
>>> soup.title
<title>hello, world</title>

# p 标签
>>> soup.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

# p 标签的内容
>>> soup.p.string
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;

BeatifulSoup abstracts HTML into four main data types, namely Tag, NavigableString, BeautifulSoup, and Comment. Each tag node is a Tag object. NavigableString objects are generally strings wrapped in Tag objects. BeautifulSoup objects represent the entire HTML document. For example:

>>> type(soup)
<class &#39;bs4.BeautifulSoup&#39;>
>>> type(soup.h1)
<class &#39;bs4.element.Tag&#39;>
>>> type(soup.p.string)
<class &#39;bs4.element.NavigableString&#39;>

Tag

Each Tag has a name, which corresponds to the HTML tag name.


>>> soup.h1.name
u&#39;h1&#39;
>>> soup.p.name
u&#39;p&#39;

Tags can also have attributes. The access method of attributes is similar to that of dictionaries. It returns a list object

>>> soup.p[&#39;class&#39;]
[u&#39;bold&#39;]

NavigableString

To get the content in the tag, you can get it directly by using .stirng. It is a NavigableString object, and you can explicitly convert it to a unicode string.

>>> soup.p.string
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;
>>> type(soup.p.string)
<class &#39;bs4.element.NavigableString&#39;>
>>> unicode_str = unicode(soup.p.string)
>>> unicode_str
u&#39;\u5982\u4f55\u4f7f\u7528BeautifulSoup&#39;

After introducing the basic concepts, we can now officially enter the topic. How to find the data we care about from HTML? BeautifulSoup provides two methods, one is traversal and the other is search. Usually the two are combined to complete the search task.

Traversing the document tree

Traversing the document tree, as the name suggests, starts from the root node html tag and traverses until the target element is found. One drawback of is that if the content you are looking for is at the end of the document, then it has to traverse the entire document to find it, which is slow. Therefore, it is necessary to cooperate with the second method.

Obtaining tag nodes by traversing the document tree can be obtained directly through . tag name, for example:

Get the body tag:

>>> soup.body
<body>\n<h1>BeautifulSoup</h1>\n<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>\n</body>

获取 p 标签

>>> soup.body.p
<p class="bold">\u5982\u4f55\u4f7f\u7528BeautifulSoup</p>

获取 p 标签的内容

>>> soup.body.p.string
\u5982\u4f55\u4f7f\u7528BeautifulSoup

前面说了,内容也是一个节点,这里就可以用 .string 的方式得到。遍历文档树的另一个缺点是只能获取到与之匹配的第一个子节点,例如,如果有两个相邻的 p 标签时,第二个标签就没法通过 .p 的方式获取,这是需要借用 next_sibling 属性获取相邻且在后面的节点。此外,还有很多不怎么常用的属性,比如:.contents 获取所有子节点,.parent 获取父节点,更多的参考请查看官方文档。

搜索文档树

搜索文档树是通过指定标签名来搜索元素,另外还可以通过指定标签的属性值来精确定位某个节点元素,最常用的两个方法就是 find 和 find_all。这两个方法在 BeatifulSoup 和 Tag 对象上都可以被调用。

find_all()

find_all( name , attrs , recursive , text , **kwargs )

find_all 的返回值是一个 Tag 组成的列表,方法调用非常灵活,所有的参数都是可选的。

第一个参数 name 是标签节点的名字。

# 找到所有标签名为title的节点
>>> soup.find_all("title")
[<title>hello, world</title>]
>>> soup.find_all("p")
[<p class="bold">\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>, 
<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

第二个参数是标签的class属性值

# 找到所有class属性为big的p标签
>>> soup.find_all("p", "big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

等效于

>>> soup.find_all("p", class_="big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

因为 class 是 Python 关键字,所以这里指定为 class_。

kwargs 是标签的属性名值对,例如:查找有href属性值为 "http://foofish.net" 的标签

>>> soup.find_all(href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" )
[<a href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]

当然,它还支持正则表达式

>>> import re
>>> soup.find_all(href=re.compile("^http"))
[<a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]

属性除了可以是具体的值、正则表达式之外,它还可以是一个布尔值(True/Flase),表示有属性或者没有该属性。

>>> soup.find_all(id="key1")
[<p class="big" id="key1"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
>>> soup.find_all(id=True)
[<p class="big" id="key1"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]

遍历和搜索相结合查找,先定位到 body 标签,缩小搜索范围,再从 body 中找 a 标签。

>>> body_tag = soup.body
>>> body_tag.find_all("a")
[<a href="http://foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>]

find()

find 方法跟 find_all 类似,唯一不同的地方是,它返回的单个 Tag 对象而非列表,如果没找到匹配的节点则返回 None。如果匹配多个 Tag,只返回第0个。

>>> body_tag.find("a")
<a href="foofish.net" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >python</a>
>>> body_tag.find("p")
<p class="bold">\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup</p>

get_text()

获取标签里面内容,除了可以使用 .string 之外,还可以使用 get_text 方法,不同的地方在于前者返回的一个 NavigableString 对象,后者返回的是 unicode 类型的字符串。

>>> p1 = body_tag.find(&#39;p&#39;).get_text()
>>> type(p1)
<type &#39;unicode&#39;>
>>> p1
u&#39;\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup&#39;

>>> p2 = body_tag.find("p").string
>>> type(p2)
<class &#39;bs4.element.NavigableString&#39;>
>>> p2
u&#39;\xc8\xe7\xba\xce\xca\xb9\xd3\xc3BeautifulSoup&#39;
>>>

实际场景中我们一般使用 get_text 方法获取标签中的内容。

总结

BeatifulSoup 是一个用于操作 HTML 文档的 Python 库,初始化 BeatifulSoup 时,需要指定 HTML 文档字符串和具体的解析器。BeatifulSoup 有3类常用的数据类型,分别是 Tag、NavigableString、和 BeautifulSoup。查找 HTML元素有两种方式,分别是遍历文档树和搜索文档树,通常快速获取数据需要二者结合。

【相关推荐】

1. python爬虫入门(5)--正则表达式实例教程

2. python爬虫入门(3)--利用requests构建知乎API

3. python爬虫入门(2)--HTTP库requests

4.  总结Python的逻辑运算符and

5. python爬虫入门(1)--快速理解HTTP协议

The above is the detailed content of Introduction to python crawlers (4)--Detailed explanation of HTML text parsing library BeautifulSoup. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn