Introduction to how the Beautiful Soup module creates objects in Python-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Introduction to how the Beautiful Soup module creates objects in Python

Y2J

Apr 22, 2017 am 09:45 AM

beautiful souppythonCreate object

This article mainly introduces the relevant information about Python using the Beautiful Soup module to create objects. The introduction in the article is very detailed. I believe it has certain reference value for everyone. Friends who need it can take a look below.

Installation

Install the Beautiful Soup module via pip: pip install beautifulsoup4 .

You can also use PyCharm IDE to write code. Find Project in Preferences in PyCharm, search for the Beautiful Soup module in it, and install it.

Create a BeautifulSoup object

The Beautiful Soup module is widely used to get data from web pages. We can use the Beautiful Soup module to extract any data from an HTML/XML document, for example, all links in a web page or content within tags.

To achieve this, Beautiful Soup provides different objects and methods. Any HTML/XML document can be converted into different Beautiful Soup objects. These objects have different properties and methods, and we can extract the required data from them.

Beautiful Soup has a total of three objects:

BeautifulSoup
Tag
NavigableString

Create a BeautifulSoup object

Creating a BeautifulSoup object is the starting point for any Beautiful Soup project.

BeautifulSoup can pass a string or file-like object, such as a file or web page on the machine.

Creating BeautifulSoup objects from strings

Create objects by passing a string in the constructor of BeautifulSoup.

helloworld = &#39;<p>Hello World</p>&#39;
soup_string = BeautifulSoup(helloworld)
print soup_string 
<html><body><p>Hello World</p></body></html>

Creating BeautifulSoup objects through file-like objects

Create objects by passing a file-like object in the constructor of BeautifulSoup. This is useful when parsing online web pages.

url = "http://www.glumes.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
print soup

In addition to passing file-like objects, we can also pass local file objects to the constructor of BeautifulSoup to generate objects.

with open(&#39;foo.html&#39;,&#39;r&#39;) as foo_file :
 soup_foo = BeautifulSoup(foo_file)
print soup_foo

Creating BeautifulSoup objects for XML parsing

The Beautiful Soup module can also be used to parse XML.

When creating a BeautifulSoup object, the Beautiful Soup module will select the appropriate TreeBuilder class to create the HTML/XML tree. By default, the HTML TreeBuilder object is selected, which will use the default HTML parser to produce an HTML structure tree. In the above code, the BeautifulSoup object is generated from the string by parsing it into an HTML tree structure.

If we want the Beautiful Soup module to parse the input content into XML type, then we need to accurately specify the features parameter used in the Beautiful Soup constructor. By specifying the features parameter, Beautiful Soup will select the most suitable TreeBuilder class to meet the features we want.

Understanding features parameters

Each TreeBuilder will have different features depending on the parser it uses. Therefore, the input content will have different results depending on the features parameter passed to the constructor.
In the Beautiful Soup module, the parser currently used by TreeBuilder is as follows:

lxml
html5lib
html.parser

The features parameter of the BeautifulSoup constructor can accept a string list or a string value.

Currently, the features parameters and parsers supported by each TreeBuilder are as shown in the following table:

##FeaturesTreeBuilderParser['lxml','html','fast','permissive']LXMLTreeBuilderlxml['html','html5lib','permissive','strict','html5′]HTML5TreeBuilderhtml5lib['html','strict','html.parser']HTMLParserTreeBuilderhtml.parser['xml','lxml','permissive','fast']LXMLTreeBuilderForXMLlxml

根据指定的 feature 参数，Beautiful Soup 将会选择最合适的 TreeBuilder 类。如果在指定对应的解析器时，出现如下的报错信息，可能就是需要安装对应的解析器了。

bs4.FeatureNotFound: Couldn&#39;t find a tree builder with the features you requested: html5lib. 
Do you need to install a parser library?

就 HTML 文档而言，选择 TreeBuilder 的顺序是基于解析器建立的优先级，就如上表格所示的优先级。首先是 lxml ，其次是 html5lib ，最后才是 html.parser 。例如，我们选择 html 字符串作为 feature 参数，那么如果 lxml 解析器可用，则 Beautiful Soup 模块将会选择 LXMLTreeBuilder 。如果 lxml 不可用，则会选择根据 html5lib 解析器选择 HTML5TreeBuilder 。如果在不可用，则会选择根据 html.parser 选择 HTMLParserTreeBuilder 了。

至于 XML ，由于 lxml 是唯一的解析器，所以 LXMLTreeBuilderForXML 总是会被选择的。

所以，为 XML 创建一个 Beautiful Soup 对象的代码如下：

helloworld = &#39;<p>Hello World</p>&#39;
soup_string = BeautifulSoup(helloworld,features="xml")
print soup_string

输入的结果也是 XML 形式的文件：

在创建 Beautiful Soup 对象时，更好的实践是指定解析器。这是因为，不同的解析器解析的结果内容大不相同，尤其是在我们的 HTML 文档内容非法时，结果更为明显。

当我们创建一个 BeautifulSoup 对象时，Tag 和 NavigableString 对象也就创建了。

创建 Tag 对象

我们可以从 BeautifulSoup 对象中得到 Tag 对象，也就是 HTML/XML 中的标签。

如下 HTML 代码所示：

#!/usr/bin/python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html_atag = """
 <html>
 <body>
 <p>Test html a tag example</p>
 <a href="http://www.glumes.com&#39;>Home</a>
 <a href="http;//www.glumes.com/index.html&#39;>Blog</a>
 </body>
 <html>
 """
soup = BeautifulSoup(html_atag,&#39;html.parser&#39;)
atag = soup.a
print type(atag)
print atag

从结果中可以看到 atag 的类型是。而 soup.a 的结果就是 HTML 文档中的第一个标签。
HTML/XML 标签对象具有名称和属性。名称就是标签的名字，例如标签的名称就是 a 。属性则是标签的 class 、id 、style 等。Tag 对象允许我们得到 HTML 标签的名称和属性。

Tag 对象的名称

通过 .name 方式得到 Tag 对象的名称。

tagname = atag.name
print tagname

同时也能够改变 Tag 对象的名称：

atag.name = &#39;p&#39;

这样就将上面 HTML 文档中的第一个标签名称换成了

标签了。

Tag 对象的属性

在 HTML 页面中，标签可能有不同的属性，例如 class 、id 、style 等。Tag 对象能够以字典的形式访问标签的属性。

atag = soup_atag.a
print atag

也能通过 .attrs 的方式访问到，这样会将所有的属性内容都打印出来：

print atag.attrs
{&#39;href&#39;: u&#39;http://www.glumes.com&#39;}

创建 NavigableString 对象

NavigableString 对象持有 HTML 或 XML 标签的文本内容。这是一个 Unicode 编码的字符串。

我们可以通过 .string 的方式得到标签的本文内容。

navi = atag.string
print type(navi)
print navi.string

小结

代码小结如下：

BeautifulSoup

soup = BeautifulSoup(String)

soup = BeautifulSoup(String,features=”xml”)

Tag

tag = soup.tag

tag.name

tag[‘attribute']

NavigableString

soup.tag.string

总结

The above is the detailed content of Introduction to how the Beautiful Soup module creates objects in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How are arrays used in scientific computing with Python?Apr 25, 2025 am 12:28 AM

ArraysinPython,especiallyviaNumPy,arecrucialinscientificcomputingfortheirefficiencyandversatility.1)Theyareusedfornumericaloperations,dataanalysis,andmachinelearning.2)NumPy'simplementationinCensuresfasteroperationsthanPythonlists.3)Arraysenablequick

How do you handle different Python versions on the same system?Apr 25, 2025 am 12:24 AM

You can manage different Python versions by using pyenv, venv and Anaconda. 1) Use pyenv to manage multiple Python versions: install pyenv, set global and local versions. 2) Use venv to create a virtual environment to isolate project dependencies. 3) Use Anaconda to manage Python versions in your data science project. 4) Keep the system Python for system-level tasks. Through these tools and strategies, you can effectively manage different versions of Python to ensure the smooth running of the project.

What are some advantages of using NumPy arrays over standard Python arrays?Apr 25, 2025 am 12:21 AM

NumPyarrayshaveseveraladvantagesoverstandardPythonarrays:1)TheyaremuchfasterduetoC-basedimplementation,2)Theyaremorememory-efficient,especiallywithlargedatasets,and3)Theyofferoptimized,vectorizedfunctionsformathematicalandstatisticaloperations,making

How does the homogenous nature of arrays affect performance?Apr 25, 2025 am 12:13 AM

The impact of homogeneity of arrays on performance is dual: 1) Homogeneity allows the compiler to optimize memory access and improve performance; 2) but limits type diversity, which may lead to inefficiency. In short, choosing the right data structure is crucial.

What are some best practices for writing executable Python scripts?Apr 25, 2025 am 12:11 AM

TocraftexecutablePythonscripts,followthesebestpractices:1)Addashebangline(#!/usr/bin/envpython3)tomakethescriptexecutable.2)Setpermissionswithchmod xyour_script.py.3)Organizewithacleardocstringanduseifname=="__main__":formainfunctionality.4

How do NumPy arrays differ from the arrays created using the array module?Apr 24, 2025 pm 03:53 PM

NumPyarraysarebetterfornumericaloperationsandmulti-dimensionaldata,whilethearraymoduleissuitableforbasic,memory-efficientarrays.1)NumPyexcelsinperformanceandfunctionalityforlargedatasetsandcomplexoperations.2)Thearraymoduleismorememory-efficientandfa

How does the use of NumPy arrays compare to using the array module arrays in Python?Apr 24, 2025 pm 03:49 PM

NumPyarraysarebetterforheavynumericalcomputing,whilethearraymoduleismoresuitableformemory-constrainedprojectswithsimpledatatypes.1)NumPyarraysofferversatilityandperformanceforlargedatasetsandcomplexoperations.2)Thearraymoduleislightweightandmemory-ef

How does the ctypes module relate to arrays in Python?Apr 24, 2025 pm 03:45 PM

ctypesallowscreatingandmanipulatingC-stylearraysinPython.1)UsectypestointerfacewithClibrariesforperformance.2)CreateC-stylearraysfornumericalcomputations.3)PassarraystoCfunctionsforefficientoperations.However,becautiousofmemorymanagement,performanceo

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

4 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months agoByDDD

Atomfall guide: item locations, quest guides, and tips

1 months agoByDDD

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version

SublimeText3 Linux latest version

Notepad++7.3.1

Easy-to-use and free code editor

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7705

1640

1394

1288

1231