来自此书: [美]Ryan Mitchell 《Python网络数据采集》
,例子是照搬的,觉得跟着敲一遍还是有作用的,所以记录下来。
import requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page1.html') soup = BeautifulSoup(res.text, 'lxml')print(soup.h1)
4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a
使用urllib访问页面是这样的,read返回的是字节,需要解码为utf-8的文本。像这样a.read().decode('utf-8')
,不过在使用bs4解析时候,可以直接传入urllib库返回的响应对象。
import urllib.request a = urllib.request.urlopen('https://www.pythonscraping.com/pages/page1.html') soup = BeautifulSoup(a, 'lxml')print(soup.h1)
4a249f0d628e2318394fd9b75b4636b1An Interesting Title473f0a7621bec819994bb5020d29372a
抓取所有CSS class属性为green的span标签,这些是人名。
import requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/warandpeace.html') soup = BeautifulSoup(res.text, 'lxml') green_names = soup.find_all('span', class_='green')for name in green_names:print(name.string)
Anna Pavlovna Scherer Empress Marya Fedorovna Prince Vasili Kuragin Anna Pavlovna St. Petersburg the prince Anna Pavlovna Anna Pavlovna ...
孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代,而后代标签则包括了父标签下面所有的子子孙孙。通俗来说,descendant包括了child。
import requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page3.html') soup = BeautifulSoup(res.text, 'lxml') gifts = soup.find('table', id='giftList').childrenfor name in gifts:print(name)
a34de1251f0d9fe1e645927f19a896e8b4d429308760b6c2d20d6300079ed38e Item Title 01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e Description 01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e Cost 01c3ce868d2b3d9bce8da5c1b7e41e5bb4d429308760b6c2d20d6300079ed38e Image 01c3ce868d2b3d9bce8da5c1b7e41e5bfd273fcf5bcad3dfdad3c41bd81ad3e5 74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c Vegetable Basket b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c This vegetable basket is the perfect gift for your health conscious (or overweight) friends! ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c $15.00 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c 9a46db21579a768d08373e8ca7876194 b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5 122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c Russian Nesting Dolls b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c $10,000.52 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c a97819b98f35d216cfd6dd3cef5adb05 b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5
找到表格后,选取当前结点为tr,并找到这个tr之后的兄弟节点,由于第一个tr为表格标题,这样的写法能提取出所有除开表格标题的正文数据。
import requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page3.html') soup = BeautifulSoup(res.text, 'lxml') gifts = soup.find('table', id='giftList').tr.next_siblingsfor name in gifts:print(name)
74fba5e8748c18e4c7aaf54d8810cc88b6c5a531a458a2e790c1fd6421739d1c Vegetable Basket b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c This vegetable basket is the perfect gift for your health conscious (or overweight) friends! ad2cf09ead3580775b079926a632eacdNow with super-colorful bell peppers!54bdf357c58b8a65c66d7c19c8e4d114 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c $15.00 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c 9a46db21579a768d08373e8ca7876194 b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5 122764c9c24651b25e9ef14256824881b6c5a531a458a2e790c1fd6421739d1c Russian Nesting Dolls b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! ad2cf09ead3580775b079926a632eacd8 entire dolls per set! Octuple the presents!54bdf357c58b8a65c66d7c19c8e4d114 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c $10,000.52 b90dd5946f0946207856a8a37f441edfb6c5a531a458a2e790c1fd6421739d1c a97819b98f35d216cfd6dd3cef5adb05 b90dd5946f0946207856a8a37f441edffd273fcf5bcad3dfdad3c41bd81ad3e5
查找商品的价格,可以根据商品的图片找到其父标签b6c5a531a458a2e790c1fd6421739d1c
,其上一个兄弟标签就是价格。
import requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page3.html') soup = BeautifulSoup(res.text, 'lxml') price = soup.find('img', src='../img/gifts/img1.jpg').parent.previous_sibling.stringprint(price)
$15.00
采集所有商品图片,为了避免其他图片乱入。使用正则表达式精确搜索。
import reimport requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page3.html') soup = BeautifulSoup(res.text, 'lxml') imgs= soup.find_all('img', src=re.compile(r'../img/gifts/img.*.jpg'))for img in imgs:print(img['src'])
../img/gifts/img1.jpg ../img/gifts/img2.jpg ../img/gifts/img3.jpg ../img/gifts/img4.jpg ../img/gifts/img6.jpg
find_all()
还可以传入函数,对这个函数有个要求:就是其返回值必须是布尔类型,若是True则保留,若是False则剔除。
import reimport requestsfrom bs4 import BeautifulSoup res = requests.get('https://www.pythonscraping.com/pages/page3.html') soup = BeautifulSoup(res.text, 'lxml')# lambda tag: tag.name=='img'tags = soup.find_all(lambda tag: tag.has_attr('src'))for tag in tags:print(tag)
aded21d1853ff876c7b54c407667fe5d 9a46db21579a768d08373e8ca7876194 a97819b98f35d216cfd6dd3cef5adb05 de27600560324843865abec3513f041c eaae933564f288096c0ab39186f76de4 0239643376e4305457074ec1fa25fbd2
tag是一个Element对象,has_attr
用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中,下面的写法返回的结果一样。lambda tag: tag.has_attr('src')
或lambda tag: tag.name=='img'
以上是Python数据采集--Beautifulsoup的使用的详细内容。更多信息请关注PHP中文网其他相关文章!