python利用beautifulSoup实现爬虫-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

python利用beautifulSoup实现爬虫

PHP中文网

Jun 01, 2017 am 10:20 AM

beautifulsoupreptile

以前讲过利用phantomjs做爬虫抓网页 www.jb51.net/article/55789.htm 是配合选择器做的

利用 beautifulSoup(文档：www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块，可以很轻松的抓取网页内容

# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url =&#39;http://www.baidu.com/s&#39;
values ={&#39;wd&#39;:&#39;网球&#39;}
encoded_param = urllib.urlencode(values)
full_url = url +&#39;?&#39;+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all(&#39;a&#39;)

上面可以抓取百度搜出来结果是网球的记录。

beautifulSoup内置了很多非常有用的方法。

几个比较好用的特性：

构造一个node元素

代码如下:

soup = BeautifulSoup(&#39;
Extremely bold
&#39;)
tag = soup.b
type(tag)
#

属性可以使用attr拿到，结果是字典

代码如下:

tag.attrs
# {u&#39;class&#39;: u&#39;boldest&#39;}

或者直接tag.class取属性也可。

也可以自由操作属性

tag[&#39;class&#39;] = &#39;verybold&#39;
tag[&#39;id&#39;] = 1
tag
#Extremely bolddel tag[&#39;class&#39;]
del tag[&#39;id&#39;]
tag
#Extremely boldtag[&#39;class&#39;]
# KeyError: &#39;class&#39;
print(tag.get(&#39;class&#39;))
# None

还可以随便操作，查找dom元素，比如下面的例子

1.构建一份文档

html_doc = """The Dormouse&#39;s storyThe Dormouse&#39;s storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;
and they lived at the bottom of a well...."""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

2.各种搞

soup.head
#The Dormouse&#39;s storysoup.title
#The Dormouse&#39;s storysoup.body.b
# The Dormouse&#39;s storysoup.a
# Elsiesoup.find_all(&#39;a&#39;)
# [Elsie,
# Lacie,
# Tillie]
head_tag = soup.head
head_tag
#The Dormouse&#39;s storyhead_tag.contents
[The Dormouse&#39;s story]

title_tag = head_tag.contents[0]
title_tag
#The Dormouse&#39;s storytitle_tag.contents
# [u&#39;The Dormouse&#39;s story&#39;]
len(soup.contents)
# 1
soup.contents[0].name
# u&#39;html&#39;
text = title_tag.contents[0]
text.contents

for child in title_tag.children:
  print(child)
head_tag.contents
# [The Dormouse&#39;s story]
for child in head_tag.descendants:
  print(child)
#The Dormouse&#39;s story# The Dormouse&#39;s story

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u&#39;The Dormouse&#39;s story&#39;

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

Learn the Differences Between Python's 'for' and 'while' LoopsMay 08, 2025 am 12:11 AM

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

Python concatenate lists with duplicatesMay 08, 2025 am 12:09 AM

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.

Python List Concatenation Performance: Speed ComparisonMay 08, 2025 am 12:09 AM

ThefastestmethodforlistconcatenationinPythondependsonlistsize:1)Forsmalllists,the operatorisefficient.2)Forlargerlists,list.extend()orlistcomprehensionisfaster,withextend()beingmorememory-efficientbymodifyinglistsin-place.

How do you insert elements into a Python list?May 08, 2025 am 12:07 AM

ToinsertelementsintoaPythonlist,useappend()toaddtotheend,insert()foraspecificposition,andextend()formultipleelements.1)Useappend()foraddingsingleitemstotheend.2)Useinsert()toaddataspecificindex,thoughit'sslowerforlargelists.3)Useextend()toaddmultiple

Are Python lists dynamic arrays or linked lists under the hood?May 07, 2025 am 12:16 AM

Pythonlistsareimplementedasdynamicarrays,notlinkedlists.1)Theyarestoredincontiguousmemoryblocks,whichmayrequirereallocationwhenappendingitems,impactingperformance.2)Linkedlistswouldofferefficientinsertions/deletionsbutslowerindexedaccess,leadingPytho

How do you remove elements from a Python list?May 07, 2025 am 12:15 AM

Pythonoffersfourmainmethodstoremoveelementsfromalist:1)remove(value)removesthefirstoccurrenceofavalue,2)pop(index)removesandreturnsanelementataspecifiedindex,3)delstatementremoveselementsbyindexorslice,and4)clear()removesallitemsfromthelist.Eachmetho

What should you check if you get a 'Permission denied' error when trying to run a script?May 07, 2025 am 12:12 AM

Toresolvea"Permissiondenied"errorwhenrunningascript,followthesesteps:1)Checkandadjustthescript'spermissionsusingchmod xmyscript.shtomakeitexecutable.2)Ensurethescriptislocatedinadirectorywhereyouhavewritepermissions,suchasyourhomedirectory.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.