Heim >Backend-Entwicklung >Python-Tutorial >python利用beautifulSoup实现爬虫

python利用beautifulSoup实现爬虫

PHP中文网Original: 2017-06-01 10:20:141767Durchsuche

以前讲过利用phantomjs做爬虫抓网页 www.jb51.net/article/55789.htm 是配合选择器做的

利用 beautifulSoup(文档：www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块，可以很轻松的抓取网页内容

# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url =&#39;http://www.baidu.com/s&#39;
values ={&#39;wd&#39;:&#39;网球&#39;}
encoded_param = urllib.urlencode(values)
full_url = url +&#39;?&#39;+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all(&#39;a&#39;)

上面可以抓取百度搜出来结果是网球的记录。

beautifulSoup内置了很多非常有用的方法。

几个比较好用的特性：

构造一个node元素

代码如下:

soup = BeautifulSoup(&#39;
Extremely bold
&#39;)
tag = soup.b
type(tag)
#

属性可以使用attr拿到，结果是字典

代码如下:

tag.attrs
# {u&#39;class&#39;: u&#39;boldest&#39;}

或者直接tag.class取属性也可。

也可以自由操作属性

tag[&#39;class&#39;] = &#39;verybold&#39;
tag[&#39;id&#39;] = 1
tag
#Extremely bolddel tag[&#39;class&#39;]
del tag[&#39;id&#39;]
tag
#Extremely boldtag[&#39;class&#39;]
# KeyError: &#39;class&#39;
print(tag.get(&#39;class&#39;))
# None

还可以随便操作，查找dom元素，比如下面的例子

1.构建一份文档

html_doc = """The Dormouse&#39;s storyThe Dormouse&#39;s storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;
and they lived at the bottom of a well...."""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

2.各种搞

soup.head
#The Dormouse&#39;s storysoup.title
#The Dormouse&#39;s storysoup.body.b
# The Dormouse&#39;s storysoup.a
# Elsiesoup.find_all(&#39;a&#39;)
# [Elsie,
# Lacie,
# Tillie]
head_tag = soup.head
head_tag
#The Dormouse&#39;s storyhead_tag.contents
[The Dormouse&#39;s story]

title_tag = head_tag.contents[0]
title_tag
#The Dormouse&#39;s storytitle_tag.contents
# [u&#39;The Dormouse&#39;s story&#39;]
len(soup.contents)
# 1
soup.contents[0].name
# u&#39;html&#39;
text = title_tag.contents[0]
text.contents

for child in title_tag.children:
  print(child)
head_tag.contents
# [The Dormouse&#39;s story]
for child in head_tag.descendants:
  print(child)
#The Dormouse&#39;s story# The Dormouse&#39;s story

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u&#39;The Dormouse&#39;s story&#39;

Stellungnahme：

Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn

Vorheriger Artikel：跟老齐学Python之让人欢喜让人忧的迭代Nächster Artikel：python中sets模块的用法实例

In Verbindung stehende Artikel

Mehr sehen