Home  >  Article  >  Backend Development  >  Python data collection--Usage of Beautifulsoup

Python data collection--Usage of Beautifulsoup

巴扎黑
巴扎黑Original
2017-07-17 15:53:311857browse

Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》,例子是照搬的,觉得跟着敲一遍还是有作用的,所以记录下来。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(res.text, 'lxml')print(soup.h1)
4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a

使用urllib访问页面是这样的,read返回的是字节,需要解码为utf-8的文本。像这样a.read().decode('utf-8'),不过在使用bs4解析时候,可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(a, 'lxml')print(soup.h1)
4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a

抓取所有CSS class属性为green的span标签,这些是人名。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/warandpeace.html')

soup = BeautifulSoup(res.text, 'lxml')
green_names = soup.find_all('span', class_='green')for name in green_names:print(name.string)


Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
...

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代,而后代标签则包括了父标签下面所有的子子孙孙。通俗来说,descendant包括了child。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').childrenfor name in gifts:print(name)


a34de1251f0d9fe1e645927f19a896e8b4d429308760b6c2d20d6300079ed38e
Item Title
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Description
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Cost
01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e
Image
01c3ce868d2b3d9bce8da5c1b7e41e5bfd273fcf5bcad3dfdad3c41bd81ad3e5


74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c
Vegetable Basket
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$15.00
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
9a46db21579a768d08373e8ca7876194
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5


122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c
Russian Nesting Dolls
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$10,000.52
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
a97819b98f35d216cfd6dd3cef5adb05
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5

找到表格后,选取当前结点为tr,并找到这个tr之后的兄弟节点,由于第一个tr为表格标题,这样的写法能提取出所有除开表格标题的正文数据。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').tr.next_siblingsfor name in gifts:print(name)


74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c
Vegetable Basket
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$15.00
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
9a46db21579a768d08373e8ca7876194
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5


122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c
Russian Nesting Dolls
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
$10,000.52
b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c
a97819b98f35d216cfd6dd3cef5adb05
b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5

查找商品的价格,可以根据商品的图片找到其父标签b6c5a531a458a2e790c1fd6421739d1c,其上一个兄弟标签就是价格。

import requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
price = soup.find('img', src='../img/gifts/img1.jpg').parent.previous_sibling.stringprint(price)


$15.00

采集所有商品图片,为了避免其他图片乱入。使用正则表达式精确搜索。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
imgs= soup.find_all('img', src=re.compile(r'../img/gifts/img.*.jpg'))for img in imgs:print(img['src'])


../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

find_all()还可以传入函数,对这个函数有个要求:就是其返回值必须是布尔类型,若是True则保留,若是False则剔除。

import reimport requestsfrom bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')# lambda tag: tag.name=='img'tags = soup.find_all(lambda tag: tag.has_attr('src'))for tag in tags:print(tag)


aded21d1853ff876c7b54c407667fe5d
9a46db21579a768d08373e8ca7876194
a97819b98f35d216cfd6dd3cef5adb05
de27600560324843865abec3513f041c
eaae933564f288096c0ab39186f76de4
0239643376e4305457074ec1fa25fbd2

tag是一个Element对象,has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中,下面的写法返回的结果一样。
lambda tag: tag.has_attr('src')lambda tag: tag.name=='img'


The above is the detailed content of Python data collection--Usage of Beautifulsoup. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn