Home > Article > Backend Development > Crawler analysis method 2: Beautifulsoup
Many languages can crawl, but python-based crawlers are more concise and convenient. Crawlers have also become an essential part of the python language. There are also many ways to parse crawlers.
Everyone must have mastered the usage of the Requests library, but when we use Requests to obtain the HTML code information of the web page, how can we grab the information we want? I believe you must have tried many methods, such as the find method of strings and more advanced regular expressions. Although regular expressions can match the information we need, I believe that everyone must be very frustrated when trying the regular matching rules again and again to match a certain string.
Then, we will wonder if there is a more convenient tool. The answer is yes, we also have a powerful tool called BeautifulSoup. With it, we can easily extract the content in HTML or XML tags. In this article, let us learn about the common methods of BeautifulSoup.
The previous article explained to you the crawler analysis method 1: JOSN analysis, this article brings you Beautifulsoup analysis.
What is BeautifulSoup?
Python's web page parsing can be completed using regular expressions. So when we write, we have to match the codes one by one, and we also have to write matching rules. The overall implementation is Very complicated. As for BeautifulSoup, it is a convenient web page parsing library with efficient processing and supports multiple parsers. In most cases, we can use it to easily extract web page information without writing regular expressions.
Official Document
Installation: $ pip install beautifulsoup4
BeautifulSoup is a web page parsing library that supports many parsers, but there are two most mainstream ones. One is the Python standard library and the other is the lxml HTML parser. The usage of the two is similar:
from bs4 import BeautifulSoup # Python的标准库 BeautifulSoup(html, 'html.parser') # lxml BeautifulSoup(html, 'lxml')
The execution speed of Python’s built-in standard library is average, but in lower versions of Python, the fault tolerance of Chinese is relatively poor. The execution speed of the lxmlHTML parser is fast, but it requires the installation of C language dependent libraries.
Installation of lxml
Since lxml installation depends on the C language library, when lxml is installed on Windows, we will find various strange errors. Of course, the face It is good to use pip install lxml
to install successfully. But most people will fall here.
It is recommended that you use lxml's .whl file to install. First we need to install the wheel library. Only with this library can we install the .whl file normally. pip install wheel
Download the lxml file matching the system and Python version from the official website.
In addition, friends who don’t know their own system and python version information. You need to enter the system administrator tool (CMD) or python's IDLE and enter the following code:
import pip print(pip.pep425tags.get_supported())
At this time we can see the printed Python version information.
After downloading the lxml file, we need to find the location of the file, then enter the administrator tool and use pip to install: pip install The full name of the whl file
After the installation is completed, you can enter Python and import it , if no error is reported, congratulations on successful installation.
If some friends find it troublesome, then I recommend that you install anaconda download address (if the installation speed is slow, you can find domestic mirrors). Friends who don’t know what it is can Google it. With it, those who use pip on Windows Problems with installation errors will no longer exist.
BeautifulSoup’s basic tag selection method
Although Python’s built-in standard library parser is not bad, I still recommend it to everyone. lxml because it's fast enough. Then we use the lxml parser to demonstrate the following code.
Let’s first import the example of the official document:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
HTML code, we can get a BeautifulSoup object and output it according to the standard indented format structure:
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')
We can see that the above HTML code is not complete. Next, we use the prettify() method to perform automatic completion. The comment part is the output of the operation:
print(soup.prettify()) # <html> # <head> # <title> # The Dormouse's story # </title> # </head> # <body> # <p class="title"> # <b> # The Dormouse's story # </b> # </p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. # </p> # <p class="story"> # ... # </p> # </body> # </html>
Get tag
print(soup.title) # <title>The Dormouse's story</title>
Through the output result, we can see the attribute of the obtained content, which is actually a title tag in the HTML code .
Get the name
print(soup.title.name) # 'title'
is actually the name of the label.
Get attributes
print(soup.p.attrs['class']) # 'title' print(soup.p['class']) # 'title'
To get the attributes of a label, we can use the attrs method and pass it the attribute name to get the attributes of the label. From the results, we can see that if we directly pass the p tag attribute name, we can also get the tag attribute.
Get content
print(soup.title.string) # 'The Dormouse's story'
我们还可以使用嵌套的选择,比如我们获得body标签里面p标签的内容:
print(soup.body.p.string) # 'The Dormouse's story'
常见用法
标准选择器
虽然BeautifulSoup的基本用法,标签获取,内容获取,可以解析一些 html代码。但是在遇到很多复杂的页面时,上面的方法是完全不足的,或者是很繁琐的,因为有时候有的标签会有几个属性(class、id等)。
索性BeautifulSoup给我们提供了很方便的标准选择器,也就是 API 方法,这里着重介绍2个: find() 和 find_all() 。其它方法的参数和用法类似,大家举一反三吧。
find_all()
find_all(name, attrs, recursive, text, **kwargs)可以根据标签,属性,内容查找文档。
find_all()其实和正则表达式的原理很相似,他能找出所有能满足匹配模式的结果,在把结果以列表的形式返回。
仍然是文档的例子:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')
过滤器
文档参考
介绍 find_all() 方法前,大家可以参考一下过滤器的类型。过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切。这些过滤器贯穿整个搜索的API。过滤器可以被用在 tag 的name中,节点的属性中,字符串中或他们的混合中。
find_all() 方法搜索当前 tag 的所有 tag 子节点,并判断是否符合过滤器的条件。这里有几个例子:
soup.find_all("title") # [<title>The Dormouse's story</title>] soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>] soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
有几个方法很相似,还有几个方法是新的,参数中的 string 和id是什么含义? 为什么 find_all("p", "title") 返回的是CSS Class为”title”的标签? 我们来仔细看一下find_all()的参数:
name参数
name 参数可以查找所有名字为 name 的 tag,字符串对象会被自动忽略掉。
soup.find_all("title") # [The Dormouse's story]
搜索 name 参数的值可以使任一类型的过滤器,字符窜,正则表达式,列表,方法或是True 。
我们常用的 name 参数是搜索文档的标签名。
keyword参数
如果我们的 HTML代码中有几个div标签,但是我们只想获取到class属性为top的div标签,我们怎么出来呢。
soup.find_all('div', class_='top')
# 这里注意下,class是Python的内部关键词,我们需要在css属性class后面加一个下划线'_',不然会报错。
仍然以上面的代码实例:
soup.find_all('a', id='link2') # [<a id="link2" href="http://example.com/lacie">Lacie</a>]
这样我们就只获取到id为link2的a标签。
limit参数
find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与 SQL 中的limit关键字类似,当搜索到的结果数量达到limit的限制时,就停止搜索返回结果。
比如我们要搜索出a标签,但是满足的有3个,我们只想要得到2个:
soup.find_all("a", limit=2) # [<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>, # <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a>]
其他的参数,不是经常用到,大家如需了解可以参考官方文档。
find()
find_all()返回的是所有元素列表,find()返回单个元素。
find( name , attrs , recursive , string , **kwargs )
find_all()方法将返回文档中符合条件的所有 tag,尽管有时候我们只想得到一个结果。比如文档中只有一个标签,那么使用find_all()方法来查找标签就不太合适, 使用find_all方法并设置limit=1参数不如直接使用find()方法。下面两行代码是等价的:
soup.find_all('title', limit=1) # [The Dormouse's story] soup.find('title') #The Dormouse's story
唯一的区别是find_all()方法的返回结果是值包含一个元素的列表,而find()方法直接返回结果。find_all()方法没有找到目标是返回空列表, find()方法找不到目标时,返回None。
CSS选择器
Beautiful Soup支持大部分的 CSS选择器。在Tag或BeautifulSoup对象的.select()方法中传入字符串参数, 即可使用 CSS选择器的语法找到 tag。我们在写 css 时,标签 class类名加”.“,id属性加”#“。
soup.select("title") # [The Dormouse's story]
通过 tag标签逐层查找:
soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") # [<title>The Dormouse's story</title>]
找到某个 tag标签下的直接子标签:
soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []
通过 CSS 的 class类名查找:
soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
通过 tag 的 id 查找:
soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
同时用多种 CSS选择器查询元素,使用逗号隔开:
soup.select("#link1,#link2") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
提取标签内容
如果我们得到了几个标签:
list = [<a href="http://www.baidu.com/">百度</a>, <a href="http://www.163.com/">网易</a>, <a href="http://www.sina.com/"新浪</a>]
我们要怎样提取他里面的内容呢。我们开始的时候有提及。
for i in list: print(i.get_text()) # 我们使用get_text()方法获得标签内容 print(i.get['href'] # get['attrs']方法获得标签属性 print(i['href']) # 简写结果一样
结果:
百度 网易 新浪 http://www.baidu.com/ http://www.163.com/ http://www.sina.com/ http://www.baidu.com/ http://www.163.com/ http://www.sina.com/
总结
BeautifulSoup's parsing library, it is recommended to use lxml. If garbled characters appear, you can use html.parser; BeautifulSoup's tag selection and filtering method is weak but fast; it is recommended to use find_all(), find() methods to search tags , of course, if you are familiar with CSS selectors, it is recommended to use the .select() method; the get_text() method to obtain the label text content, and the get[attrs] method to obtain the label attribute value.
The above is the detailed content of Crawler analysis method 2: Beautifulsoup. For more information, please follow other related articles on the PHP Chinese website!