Home >Backend Development >Python Tutorial >Python BeautifulSoup library installation and introduction

Python BeautifulSoup library installation and introduction

2017-03-11 09:49:023184browse

##1. Preface

## In the previous articles I introduced how to analyze through Python Source code to crawl blogs, Wikipedia InfoBox and pictures, the article link is as follows:
[Python learning] Simple crawling of Wikipedia programming language message box
[Python learning] Simple web crawler crawling blog articles and ideas introduction [Python learning] Simply crawl the pictures in the picture website gallery

The core code is as follows:

# coding=utf-8
import urllib
import re

content = urllib.urlopen(url).read()
title_obj=re.search(title_ex, content)
print title
href = r&#39;<a href=.*?>(.*?)</a>&#39;
m = re.findall(href,content,re.S|re.M)
for text in m:
    print unicode(text,&#39;utf-8&#39;)
    break #只输出一个url

The output result is as follows:

CSDN.NET - 全球最大中文IT社区,为IT专业技术人员提供最全面的信息传播和服务平台

The core code for image downloading is as follows:

import os
import urllib
class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0"
urllib._urlopener = AppURLopener()
url = "http://creatim.allyes.com.cn/imedia/csdn/20150228/15_41_49_5B9C9E6A.jpg"
filename = os.path.basename(url)
urllib.urlretrieve(url , filename)

But the above method of analyzing HTML to crawl website content has many drawbacks, such as: 1. Regular expressions are constrained by the HTML source code, rather than depending on more abstract structures ;Small changes in the structure of the web page may cause program interruption. 2. The program needs to analyze the content based on the actual HTML source code. It may encounter HTML features such as character entities such as &, and needs to specify processing such as 45a2772a6b6107b401db3c9b82c049c254bdf357c58b8a65c66d7c19c8e4d114, icon hyperlinks, subscripts, etc. Different content.
3. Regular expressions are not completely readable, and more complex HTML codes and query expressions will become messy.

##                                                                                                                                                                         Basic Tutorial (2nd Edition) uses two solutions: the first is to use Tidy (Python library) program and XHTML parsing ;The second is to use the BeautifulSoup library.

# 2. Installation and introduction Beautiful Soup library

##Beautiful Soup is an HTML/XML parser written in Python , which can handle irregular markup well and generate parse tree. It provides simple and commonly used operations for navigating, searching, and modifying parse trees. It can save your programming time greatly.
As the book says, "You didn't write those bad web pages, you just tried to get some data from them. Now you don't care what the HTML looks like , the parser helps you achieve it."      

Download address:
          http://www .php.cn/
                The installation process is as shown below: python setup.py install

## It is recommended to refer to Chinese for specific usage methods: http://www.php.cn/ Among them, the usage of BeautifulSoup is briefly explained, using the official example of "Alice in Wonderland":

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse&#39;s story</title></head>
<p class="title"><b>The Dormouse&#39;s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

soup = BeautifulSoup(html_doc)

Output contentThe structure output according to the standard indentation format

is as follows:

   The Dormouse&#39;s story
  <p class="title">
    The Dormouse&#39;s story
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
   <a class="sister" href="http://example.com/lacie" id="link2">
   <a class="sister" href="http://example.com/tillie" id="link3">
and they lived at the bottom of a well.
  <p class="story">
The following is a simple and quick introduction to the BeautifulSoup library: (Reference: Official Document)

print soup.title
# <title>The Dormouse&#39;s story</title>
print soup.title.name
# title
print unicode(soup.title.string)
# The Dormouse&#39;s story

print soup.p
# <p class="title"><b>The Dormouse&#39;s story</b></p>
print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print soup.find_all(&#39;a&#39;)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
for link in soup.find_all(&#39;a&#39;):
    # http://www.php.cn/
    # http://www.php.cn/
    # http://www.php.cn/
print soup.find(id=&#39;link3&#39;)
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

If you want to get all the text content in the article, the code is as follows:

print soup.get_text()
# The Dormouse&#39;s story
# The Dormouse&#39;s story
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

        1.ImportError: No module named BeautifulSoup
        当你成功安装BeautifulSoup 4库后,“from BeautifulSoup import BeautifulSoup”可能会遇到该错误。

        其中的原因是BeautifulSoup 4库改名为bs4,需要使用“from bs4 import BeautifulSoup”导入。
        2.TypeError: an integer is required
        当你使用“print soup.title.string”获取title的值时,可能会遇到该错误。如下:

        print unicode(soup.title.string)
        print str(soup.title.string)

三. Beautiful Soup常用方法介绍

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:Tag、NavigableString、BeautifulSoup、Comment|

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse&#39;s story</title></head>
<p class="title" id="start"><b>The Dormouse&#39;s story</b></p>
soup = BeautifulSoup(html)
tag = soup.p
print tag
# <p class="title" id="start"><b>The Dormouse&#39;s story</b></p>
print type(tag)
# <class &#39;bs4.element.Tag&#39;>
print tag.name
# p 标签名字
print tag[&#39;class&#39;]
# [u&#39;title&#39;]
print tag.attrs
# {u&#39;class&#39;: [u&#39;title&#39;], u&#39;id&#39;: u&#39;start&#39;}

        字符串常被包含在tag内,Beautiful Soup用NavigableString类来包装tag中的字符串。一个NavigableString字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树搜索文档树中的一些特性,通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串。

print unicode(tag.string)
# The Dormouse&#39;s story
print type(tag.string)
# <class &#39;bs4.element.NavigableString&#39;>
tag.string.replace_with("No longer bold")
print tag
# <p class="title" id="start"><b>No longer bold</b></p>

        这是获取“6924be786dcb0892a956090fd509f9a5a4b561c25d9afb9ac8dc4d70affff419The Dormouse's story0d36329ec37a2cc24d42c7229b69747a94b3e26ee717c64999d7867364b1b4a3”中tag = soup.p的值,其中tag中包含的字符串不能编辑,但可通过函数replace_with()替换。
        NavigableString 对象支持遍历文档树和搜索文档树 中定义的大部分属性, 并非全部。尤其是一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法。
        如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址。这样会浪费内存。

        3.Beautiful Soup对象
        注意:因为BeautifulSoup对象并不是真正的HTML或XML的tag,所以它没有name和 attribute属性,但有时查看它的.name属性可以通过BeautifulSoup对象包含的一个值为[document]的特殊实行.name实现——soup.name。
        Beautiful Soup中定义的其它类型都可能会出现在XML的文档中:CData , ProcessingInstruction , Declaration , Doctype 。与 Comment 对象类似,这些类都是 NavigableString 的子类,只是添加了一些额外的方法的字符串独享。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print type(comment)
# <class &#39;bs4.element.Comment&#39;>
print unicode(comment)
# Hey, buddy. Want to buy a used parser?


soup.head# 93f0f5c25f18dab9d176bd4f6de5d30eb2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f9c3bca370b5104690d9ef395f2c5f8d1soup.title# b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2fsoup.body.b# a4b561c25d9afb9ac8dc4d70affff419The Dormouse's story0d36329ec37a2cc24d42c7229b69747a


soup.find_all('a')# [31602ad08f2e88060105421a6fd98432Elsie5db79b134e9f6b82c0b36e0489ee08ed,#  7a2353bc01007f1e0b12a80523342380Lacie5db79b134e9f6b82c0b36e0489ee08ed,#  de05147b7e6ab3a313271cf7987ced2eTillie5db79b134e9f6b82c0b36e0489ee08ed]

        子节点:在分析HTML过程中通常需要分析tag的子节点,而tag的 .contents 属性可以将tag的子节点以列表的方式输出。字符串没有.contents属性,因为字符串没有子节点。

head_tag = soup.head
# 93f0f5c25f18dab9d176bd4f6de5d30eb2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f9c3bca370b5104690d9ef395f2c5f8d1

[b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f]

title_tag = head_tag.contents[0]
# b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f
# [u'The Dormouse's story']

        通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
    # The Dormouse's story

        子孙节点:同样 .descendants 属性可以对所有tag的子孙节点进行递归循环:

for child in head_tag.descendants:
    # b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f
    # The Dormouse's story

        父节点:通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,93f0f5c25f18dab9d176bd4f6de5d30e标签是b2386ffb911b14667cb8f0f91ea547a7标签的父节点,换句话就是增加一层标签。
        注意:文档的顶层节点比如100db36a723c770d327fc0aef2ce13b1的父节点是 BeautifulSoup 对象,BeautifulSoup 对象的 .parent 是None。

title_tag = soup.titletitle_tag# b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2ftitle_tag.parent# 93f0f5c25f18dab9d176bd4f6de5d30eb2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f9c3bca370b5104690d9ef395f2c5f8d1title_tag.string.parent# b2386ffb911b14667cb8f0f91ea547a7The Dormouse's story6e916e0f7d1e588d4f442bf645aedb2f


sibling_soup = BeautifulSoup("3499910bf9dac5ae3c52d5ede7383485a4b561c25d9afb9ac8dc4d70affff419text10d36329ec37a2cc24d42c7229b69747af8331b8a817c28418a431fbe6e724755text29058c75f7b5bb1e6daacddf9bb18d9590d36329ec37a2cc24d42c7229b69747a5db79b134e9f6b82c0b36e0489ee08ed")print(sibling_soup.prettify())# 100db36a723c770d327fc0aef2ce13b1#  6c04bd5ca3fcae76e30b72ad730ca86d#   3499910bf9dac5ae3c52d5ede7383485#    a4b561c25d9afb9ac8dc4d70affff419#     text1#    0d36329ec37a2cc24d42c7229b69747a#    f8331b8a817c28418a431fbe6e724755#     text2#    9058c75f7b5bb1e6daacddf9bb18d959#   5db79b134e9f6b82c0b36e0489ee08ed#  36cc49f0c466276486e50c850b7e4956# 73a6ac4ed44ffec12cee46588e518a5e

        在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点。a4b561c25d9afb9ac8dc4d70affff419标签有.next_sibling 属性,但是没有.previous_sibling 属性,因为a4b561c25d9afb9ac8dc4d70affff419标签在同级节点中是第一个。同理f8331b8a817c28418a431fbe6e724755标签有.previous_sibling 属性,却没有.next_sibling 属性:

sibling_soup.b.next_sibling# f8331b8a817c28418a431fbe6e724755text29058c75f7b5bb1e6daacddf9bb18d959sibling_soup.c.previous_sibling# a4b561c25d9afb9ac8dc4d70affff419text10d36329ec37a2cc24d42c7229b69747a

        (By:Eastmount 2015-3-25 下午6点  http://www.php.cn/)

The above is the detailed content of Python BeautifulSoup library installation and introduction. For more information, please follow other related articles on the PHP Chinese website!

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn