집 >백엔드 개발 >파이썬 튜토리얼 >Python 데이터 수집--Beautifulsoup 사용

Python 데이터 수집--Beautifulsoup 사용

巴扎黑원래의: 2017-07-17 15:53:311964검색

Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》，例子是照搬的，觉得跟着敲一遍还是有作用的，所以记录下来。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page1.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)print(soup.h1)

4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a

使用urllib访问页面是这样的，read返回的是字节，需要解码为utf-8的文本。像这样a.read().decode('utf-8')，不过在使用bs4解析时候，可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen(&#39;https://www.pythonscraping.com/pages/page1.html&#39;)
soup = BeautifulSoup(a, &#39;lxml&#39;)print(soup.h1)

4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a

抓取所有CSS class属性为green的span标签，这些是人名。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/warandpeace.html&#39;)

soup = BeautifulSoup(res.text, &#39;lxml&#39;)
green_names = soup.find_all(&#39;span&#39;, class_=&#39;green&#39;)for name in green_names:print(name.string)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
...

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代，而后代标签则包括了父标签下面所有的子子孙孙。通俗来说，descendant包括了child。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
gifts = soup.find(&#39;table&#39;, id=&#39;giftList&#39;).childrenfor name in gifts:print(name)

a34de1251f0d9fe1e645927f19a896e8b4d429308760b6c2d20d6300079ed38e
Item Title
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Description
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Cost
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Image
01c3ce868d2b3d9bce8da5c1b7e41e5bfd273fcf5bcad3dfdad3c41bd81ad3e5


74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c
Vegetable Basket
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$15.00
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
9a46db21579a768d08373e8ca7876194
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5


122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c
Russian Nesting Dolls
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$10,000.52
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
a97819b98f35d216cfd6dd3cef5adb05
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5

找到表格后，选取当前结点为tr，并找到这个tr之后的兄弟节点，由于第一个tr为表格标题，这样的写法能提取出所有除开表格标题的正文数据。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
gifts = soup.find(&#39;table&#39;, id=&#39;giftList&#39;).tr.next_siblingsfor name in gifts:print(name)

74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c
Vegetable Basket
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$15.00
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
9a46db21579a768d08373e8ca7876194
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5


122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c
Russian Nesting Dolls
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$10,000.52
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
a97819b98f35d216cfd6dd3cef5adb05
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5

查找商品的价格，可以根据商品的图片找到其父标签b6c5a531a458a2e790c1fd6421739d1c，其上一个兄弟标签就是价格。

import requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
price = soup.find(&#39;img&#39;, src=&#39;../img/gifts/img1.jpg&#39;).parent.previous_sibling.stringprint(price)

$15.00

采集所有商品图片，为了避免其他图片乱入。使用正则表达式精确搜索。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)
imgs= soup.find_all(&#39;img&#39;, src=re.compile(r&#39;../img/gifts/img.*.jpg&#39;))for img in imgs:print(img[&#39;src&#39;])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

find_all()还可以传入函数，对这个函数有个要求：就是其返回值必须是布尔类型，若是True则保留，若是False则剔除。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get(&#39;https://www.pythonscraping.com/pages/page3.html&#39;)
soup = BeautifulSoup(res.text, &#39;lxml&#39;)# lambda tag: tag.name==&#39;img&#39;tags = soup.find_all(lambda tag: tag.has_attr(&#39;src&#39;))for tag in tags:print(tag)

aded21d1853ff876c7b54c407667fe5d
9a46db21579a768d08373e8ca7876194
a97819b98f35d216cfd6dd3cef5adb05
de27600560324843865abec3513f041c
eaae933564f288096c0ab39186f76de4
0239643376e4305457074ec1fa25fbd2

tag是一个Element对象，has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中，下面的写法返回的结果一样。
lambda tag: tag.has_attr('src')或lambda tag: tag.name=='img'

위 내용은 Python 데이터 수집--Beautifulsoup 사용의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명：

이전 기사：Python 기본 입력, 출력 및 연산자다음 기사：Python 기본 입력, 출력 및 연산자