Crawler analysis method 2: Beautifulsoup-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Crawler analysis method 2: Beautifulsoup

爱喝马黛茶的安东尼

Jun 05, 2019 pm 01:25 PM

beautifulsouppythonreptile

Many languages can crawl, but python-based crawlers are more concise and convenient. Crawlers have also become an essential part of the python language. There are also many ways to parse crawlers.

Everyone must have mastered the usage of the Requests library, but when we use Requests to obtain the HTML code information of the web page, how can we grab the information we want? I believe you must have tried many methods, such as the find method of strings and more advanced regular expressions. Although regular expressions can match the information we need, I believe that everyone must be very frustrated when trying the regular matching rules again and again to match a certain string.

Then, we will wonder if there is a more convenient tool. The answer is yes, we also have a powerful tool called BeautifulSoup. With it, we can easily extract the content in HTML or XML tags. In this article, let us learn about the common methods of BeautifulSoup.

The previous article explained to you the crawler analysis method 1: JOSN analysis, this article brings you Beautifulsoup analysis.

Crawler analysis method 2: Beautifulsoup

What is BeautifulSoup?

Python's web page parsing can be completed using regular expressions. So when we write, we have to match the codes one by one, and we also have to write matching rules. The overall implementation is Very complicated. As for BeautifulSoup, it is a convenient web page parsing library with efficient processing and supports multiple parsers. In most cases, we can use it to easily extract web page information without writing regular expressions.

Official Document

Installation: $ pip install beautifulsoup4

BeautifulSoup is a web page parsing library that supports many parsers, but there are two most mainstream ones. One is the Python standard library and the other is the lxml HTML parser. The usage of the two is similar:

from bs4 import BeautifulSoup
 
# Python的标准库
BeautifulSoup(html, &#39;html.parser&#39;)
 
# lxml
BeautifulSoup(html, &#39;lxml&#39;)

The execution speed of Python’s built-in standard library is average, but in lower versions of Python, the fault tolerance of Chinese is relatively poor. The execution speed of the lxmlHTML parser is fast, but it requires the installation of C language dependent libraries.

Installation of lxml

Since lxml installation depends on the C language library, when lxml is installed on Windows, we will find various strange errors. Of course, the face It is good to use pip install lxml

to install successfully. But most people will fall here.

It is recommended that you use lxml's .whl file to install. First we need to install the wheel library. Only with this library can we install the .whl file normally. pip install wheel

Download the lxml file matching the system and Python version from the official website.

In addition, friends who don’t know their own system and python version information. You need to enter the system administrator tool (CMD) or python's IDLE and enter the following code:

import pip
 
print(pip.pep425tags.get_supported())

At this time we can see the printed Python version information.
After downloading the lxml file, we need to find the location of the file, then enter the administrator tool and use pip to install: pip install The full name of the whl file

After the installation is completed, you can enter Python and import it , if no error is reported, congratulations on successful installation.
If some friends find it troublesome, then I recommend that you install anaconda download address (if the installation speed is slow, you can find domestic mirrors). Friends who don’t know what it is can Google it. With it, those who use pip on Windows Problems with installation errors will no longer exist.

BeautifulSoup’s basic tag selection method

Although Python’s built-in standard library parser is not bad, I still recommend it to everyone. lxml because it's fast enough. Then we use the lxml parser to demonstrate the following code.
Let’s first import the example of the official document:

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""

HTML code, we can get a BeautifulSoup object and output it according to the standard indented format structure:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, &#39;lxml&#39;)

We can see that the above HTML code is not complete. Next, we use the prettify() method to perform automatic completion. The comment part is the output of the operation:

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse&#39;s story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse&#39;s story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

Get tag

print(soup.title)
# <title>The Dormouse&#39;s story</title>

Through the output result, we can see the attribute of the obtained content, which is actually a title tag in the HTML code .

Get the name

print(soup.title.name)
# &#39;title&#39;

is actually the name of the label.

Get attributes

print(soup.p.attrs[&#39;class&#39;])
# &#39;title&#39;
 
print(soup.p[&#39;class&#39;])
# &#39;title&#39;

To get the attributes of a label, we can use the attrs method and pass it the attribute name to get the attributes of the label. From the results, we can see that if we directly pass the p tag attribute name, we can also get the tag attribute.

Get content

print(soup.title.string)
# &#39;The Dormouse&#39;s story&#39;

我们还可以使用嵌套的选择，比如我们获得body标签里面p标签的内容：

print(soup.body.p.string)
# &#39;The Dormouse&#39;s story&#39;

常见用法

标准选择器

虽然BeautifulSoup的基本用法，标签获取，内容获取，可以解析一些 html代码。但是在遇到很多复杂的页面时，上面的方法是完全不足的，或者是很繁琐的，因为有时候有的标签会有几个属性（class、id等）。

索性BeautifulSoup给我们提供了很方便的标准选择器，也就是 API 方法，这里着重介绍2个: find() 和 find_all() 。其它方法的参数和用法类似,大家举一反三吧。

find_all()

find_all(name, attrs, recursive, text, **kwargs)可以根据标签，属性，内容查找文档。
find_all()其实和正则表达式的原理很相似，他能找出所有能满足匹配模式的结果，在把结果以列表的形式返回。
仍然是文档的例子：

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<body>
<p class="title"><b>The Dormouse&#39;s story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
 
soup = BeautifulSoup(html_doc, 'lxml')

过滤器

文档参考
介绍 find_all() 方法前,大家可以参考一下过滤器的类型。过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切。这些过滤器贯穿整个搜索的API。过滤器可以被用在 tag 的name中,节点的属性中,字符串中或他们的混合中。

find_all() 方法搜索当前 tag 的所有 tag 子节点,并判断是否符合过滤器的条件。这里有几个例子:

soup.find_all("title")
# [<title>The Dormouse&#39;s story</title>]
 
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse&#39;s story</b></p>]
 
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

有几个方法很相似,还有几个方法是新的,参数中的 string 和id是什么含义? 为什么 find_all("p", "title") 返回的是CSS Class为”title”的标签? 我们来仔细看一下find_all()的参数:

name参数

name 参数可以查找所有名字为 name 的 tag,字符串对象会被自动忽略掉。

soup.find_all("title")
# [The Dormouse&#39;s story]

搜索 name 参数的值可以使任一类型的过滤器,字符窜,正则表达式,列表,方法或是True 。
我们常用的 name 参数是搜索文档的标签名。

keyword参数

如果我们的 HTML代码中有几个div标签，但是我们只想获取到class属性为top的div标签，我们怎么出来呢。

soup.find_all(&#39;div&#39;, class_=&#39;top&#39;)

# 这里注意下，class是Python的内部关键词，我们需要在css属性class后面加一个下划线'_'，不然会报错。

仍然以上面的代码实例：

soup.find_all(&#39;a&#39;, id=&#39;link2&#39;)
# [<a id="link2" href="http://example.com/lacie">Lacie</a>]

这样我们就只获取到id为link2的a标签。

limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与 SQL 中的limit关键字类似,当搜索到的结果数量达到limit的限制时,就停止搜索返回结果。

比如我们要搜索出a标签，但是满足的有3个，我们只想要得到2个：

soup.find_all("a", limit=2)
# [<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
# <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a>]

其他的参数，不是经常用到，大家如需了解可以参考官方文档。

find()

find_all()返回的是所有元素列表，find()返回单个元素。

find( name , attrs , recursive , string , **kwargs )

find_all()方法将返回文档中符合条件的所有 tag,尽管有时候我们只想得到一个结果。比如文档中只有一个标签,那么使用find_all()方法来查找标签就不太合适, 使用find_all方法并设置limit=1参数不如直接使用find()方法。下面两行代码是等价的:

soup.find_all(&#39;title&#39;, limit=1)
# [The Dormouse&#39;s story]
 
soup.find(&#39;title&#39;)
#The Dormouse&#39;s story

唯一的区别是find_all()方法的返回结果是值包含一个元素的列表,而find()方法直接返回结果。find_all()方法没有找到目标是返回空列表, find()方法找不到目标时,返回None。

CSS选择器

Beautiful Soup支持大部分的 CSS选择器。在Tag或BeautifulSoup对象的.select()方法中传入字符串参数, 即可使用 CSS选择器的语法找到 tag。我们在写 css 时，标签 class类名加”.“，id属性加”#“。

soup.select("title")
# [The Dormouse&#39;s story]

通过 tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("html head title")
# [<title>The Dormouse&#39;s story</title>]

找到某个 tag标签下的直接子标签:

soup.select("head > title")
# [<title>The Dormouse&#39;s story</title>]
 
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select("body > a")
# []

通过 CSS 的 class类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 tag 的 id 查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种 CSS选择器查询元素，使用逗号隔开:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

提取标签内容

如果我们得到了几个标签：

list = [<a href="http://www.baidu.com/">百度</a>,
 
<a href="http://www.163.com/">网易</a>,
 
<a href="http://www.sina.com/"新浪</a>]

我们要怎样提取他里面的内容呢。我们开始的时候有提及。

for i in list:
    print(i.get_text()) # 我们使用get_text()方法获得标签内容
    print(i.get[&#39;href&#39;] # get[&#39;attrs&#39;]方法获得标签属性
    print(i[&#39;href&#39;]) # 简写结果一样

结果：

百度
网易
新浪
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/

总结

BeautifulSoup's parsing library, it is recommended to use lxml. If garbled characters appear, you can use html.parser; BeautifulSoup's tag selection and filtering method is weak but fast; it is recommended to use find_all(), find() methods to search tags , of course, if you are familiar with CSS selectors, it is recommended to use the .select() method; the get_text() method to obtain the label text content, and the get[attrs] method to obtain the label attribute value.

The above is the detailed content of Crawler analysis method 2: Beautifulsoup. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:CSDN. If there is any infringement, please contact admin@php.cn delete

Python and Time: Making the Most of Your Study TimeApr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Games, GUIs, and MoreApr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

Python vs. C : Applications and Use Cases ComparedApr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

The 2-Hour Python Plan: A Realistic ApproachApr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python: Exploring Its Primary ApplicationsApr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

How Much Python Can You Learn in 2 Hours?Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

How to teach computer novice programming basics in project and problem-driven methods within 10 hours?Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading?Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),