Simple use of Beautifulsoup and selenium

Review of requests library

I haven’t used it for a long timerequests, because I will write a simple crawler later, so I just write it casually A little review.

import requests

r = requests.get('https://api.github.com/user', auth=('haiyu19931121@163.com', 'Shy18137803170'))print(r.status_code)  # 状态码200print(r.json())  # 返回json格式print(r.text)  # 返回文本print(r.headers)  # 头信息print(r.encoding)  # 编码方式,一般utf-8# 当写入文件比较大时,避免内存耗尽,可以一次写指定的字节数或者一行。# 一次读一行,chunk_size=512为默认值for chunk in r.iter_lines():print(chunk)# 一次读取一块,大小为512for chunk in r.iter_content(chunk_size=512):print(chunk)

Note that iter_lines and iter_content return byte data. To write to a file, whether it is text or Pictures need to be opened in the wb way.

Using Beautifulsoup

Let’s get to the point. I have heard about this famous library for a long time. Although it was not troublesome to use regular expressions to write crawlers in the past, sometimes the matching would be inaccurate. Use Beautifulsoup to accurately extract data from HTML tags. Although it is a bit slow, it is simple and easy to use.

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse&#39;s story</title></head><body><p class="title"><b>The Dormouse&#39;s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""# 就注意一点,第二个参数指定解析器,必须填上,不然会有警告。推荐使用lxmlsoup = BeautifulSoup(html_doc, &#39;lxml&#39;)

Following the above code, look at some simple operations below. The behavior of using point attributes will get the first found data that meets the conditions. It is the abbreviation of find method.


The above two sentences are equivalent.

# soup.body是一个Tag对象。是body标签中所有html代码print(soup.body)
76541fb5e7b0d5abaf17f6416b10757ba4b561c25d9afb9ac8dc4d70affff419The Dormouse's story0d36329ec37a2cc24d42c7229b69747a94b3e26ee717c64999d7867364b1b4a3
a0d1e8d16fe601bf29354e6acb221fcbOnce upon a time there were three little sisters; and their names were
7a2353bc01007f1e0b12a80523342380Lacie5db79b134e9f6b82c0b36e0489ee08ed and
and they lived at the bottom of a well.94b3e26ee717c64999d7867364b1b4a3
# 获取body里所有文本,不含标签print(soup.body.text)# 等同于下面的写法soup.body.get_text()# 还可以这样写,strings是所有文本的生成器for string in soup.body.strings:print(string, end=&#39;&#39;)
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
# 获得该标签里的文本。print(soup.title.string)
The Dormouse's story
# Tag对象的get方法可以根据属性的名称获得属性的值,此句表示得到第一个p标签里class属性的值print(soup.p.get(&#39;class&#39;))# 和下面的写法等同print(soup.p[&#39;class&#39;])
# 查看a标签的所有属性,以字典形式给出print(soup.a.attrs)
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
# 标签的名称soup.title.name


The most commonly used method is undoubtedly the find_all / find method. The former finds all data that meets the conditions and returns a list. The latter is the first data in this list. find_all has a limit parameter that limits the length of the list (that is, the number of data that meets the search criteria). When limit=1 actually becomes the find method.

find_allThere are also abbreviations.

soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)
soup(&#39;a&#39;, id=&#39;link1&#39;)

The above two ways of writing are equivalent, and the second way of writing is an abbreviation.

find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs)


name is the tag you want to search for. For example, the following is to find all p tags. Not only can you fill in strings, but you can also pass in regular expressions, lists, functions, and True.

# 传入字符串soup.find_all(&#39;p&#39;)# 传入正则表达式import re# 必须以b开头for tag in soup.find_all(re.compile("^b")):print(tag.name)# body# b# 含有t就行for tag in soup.find_all(re.compile("t")):print(tag.name)# html# title# 传入列表表示,一次查找多个标签soup.find_all(["a", "b"])# [<b>The Dormouse&#39;s story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

If you pass in True, there will be no restrictions and everything will be searched.


When calling the find_all() method of tag, Beautiful Soup will retrieve all descendant nodes of the current tag. If you only want to search for the direct child nodes of the tag, you can Use parameter recursive=False.

# title不是html的直接子节点,但是会检索其下所有子孙节点soup.html.find_all("title")# [<title>The Dormouse&#39;s story</title>]# 参数设置为False,只会找直接子节点soup.html.find_all("title", recursive=False)# []# title就是head的直接子节点,所以这个参数此时无影响a = soup.head.find_all("title", recursive=False)# [<title name="good">The Dormouse&#39;s story</title>]

keyword and attrs

Use keyword and add one or more qualifications to narrow the search scope.

# 查看所有id为link1的p标签soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)

If you search by class, Python has already used it because of the class keyword. You can use class_, or do not specify keywords, or use attrs to fill in the dictionary.

soup.find_all(&#39;p&#39;, class_=&#39;story&#39;)
soup.find_all(&#39;p&#39;, &#39;story&#39;)
soup.find_all(&#39;p&#39;, attrs={"class": "story"})

The above three methods are equivalent. class_Can accept strings, regular expressions, functions, and True.


Search for text value, it seems that using string parameter also gives the same result.

a = soup.find_all(text=&#39;Elsie&#39;)# 或者,4.4以上版本请使用texta = soup.find_all(string=&#39;Elsie&#39;)

The text parameter can also accept strings, regular expressions, True, and lists.

CSS Selector

You can also use CSS selector. Just use the select method, select always returns a list.

List several commonly used operations.

# 所有div标签soup.select(&#39;div&#39;)# 所有id为username的元素soup.select(&#39;.username&#39;)# 所有class为story的元素soup.select(&#39;#story&#39;)# 所有div元素之内的span元素,中间可以有其他元素soup.select(&#39;div span&#39;)# 所有div元素之内的span元素,中间没有其他元素soup.select(&#39;div > span&#39;)# 所有具有一个id属性的input标签,id的值无所谓soup.select(&#39;input[id]&#39;)# 所有具有一个id属性且值为user的input标签soup.select(&#39;input[id="user"]&#39;)# 搜索多个,class为link1或者link2的元素都符合soup.select("#link1, #link2")

A small crawler example

The basic usage of requests and beautifulsoup4 is introduced above. Using these, you can already write some simple crawlers. Come and try it.

This example comes from "Get Started Quickly with Python Programming - Automate Cumbersome Work" [US] AI Sweigart

This crawler will download pictures from XKCD Comics Network in batches. You can specify the number of pages to download.

import osimport requestsfrom bs4 import BeautifulSoup# exist_ok=True,若文件夹已经存在也不会报错os.makedirs(&#39;xkcd&#39;)
url = &#39;https://xkcd.com/&#39;headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def save_img(img_url, limit=1):
    r = requests.get(img_url, headers=headers)
    soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
        img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
        response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024*1024):
                f.write(chunk)# 每次下载一张图片,就减1limit -= 1# 找到上一张图片的网址if limit > 0:try:
            prev = &#39;https://xkcd.com&#39; + soup.find(&#39;a&#39;, rel=&#39;prev&#39;).get(&#39;href&#39;)except AttributeError:print(&#39;Link Not Exist&#39;)else:
            save_img(prev, limit)if __name__ == &#39;__main__&#39;:
    save_img(url, limit=20)print(&#39;Done!&#39;)

Multi-threaded download

Single-threaded speed is a bit slow, for example, multi-threading can be used, because when we get prev, Knowing the URL of each web page is very regular. It goes like this. Only the last number is different, so we can easily use range to traverse.

import osimport threadingimport requestsfrom bs4 import BeautifulSoup


headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def download_imgs(start, end):for url_num in range(start, end):
        img_url = &#39;https://xkcd.com/&#39; + str(url_num)
        r = requests.get(img_url, headers=headers)
        soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
            img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
            response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024 * 1024):
                    f.write(chunk)if __name__ == &#39;__main__&#39;:# 下载从1到30,每个线程下载10个threads = []for i in range(1, 30, 10):
        thread_obj = threading.Thread(target=download_imgs, args=(i, i + 10))
        thread_obj.start()# 阻塞,等待线程执行结束都会等待for thread in threads:
        thread.join()# 所有线程下载完毕,才打印print(&#39;Done!&#39;)


selenium用来作自动化测试。使用前需要下载驱动,我只下载了Firefox和Chrome的。网上随便一搜就能下载到了。接下来将下载下来的文件其复制到将安装目录下,比如Firefox,将对应的驱动程序放到C:\Program Files (x86)\Mozilla Firefox,并将这个路径添加到环境变量中,同理Chrome的驱动程序放到C:\Program Files (x86)\Google\Chrome\Application并将该路径添加到环境变量。最后重启IDE开始使用吧。


下面这个例子会打开Chrome浏览器,访问百度首页,模拟输入The Zen of Python,随后点击百度一下,当然也可以用回车代替。Keys下是一些不能用字符串表示的键,比如方向键、Tab、Enter、Esc、F1~F12、Backspace等。然后等待3秒,页面跳转到知乎首页,接着返回到百度,最后退出(关闭)浏览器。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport time

browser = webdriver.Chrome()# Chrome打开百度首页browser.get(&#39;https://www.baidu.com/&#39;)# 找到输入区域input_area = browser.find_element_by_id(&#39;kw&#39;)# 区域内填写内容input_area.send_keys(&#39;The Zen of Python&#39;)# 找到"百度一下"search = browser.find_element_by_id(&#39;su&#39;)# 点击search.click()# 或者按下回车# input_area.send_keys(&#39;The Zen of Python&#39;, Keys.ENTER)time.sleep(3)
time.sleep(2)# 返回到百度搜索browser.back()
time.sleep(2)# 退出浏览器browser.quit()

browser.back()  # 返回按钮browser.forward() # 前进按钮browser.refresh()  # 刷新按钮browser.close()  # 关闭当前窗口browser.quit()  # 退出浏览器



方法名 返回的WebElement
find_element_by_id(id) 匹配id属性值的元素
find_element_by_name(name) 匹配name属性值的元素
find_element_by_class_name(name) 匹配CSS的class值的元素
find_element_by_tag_name(tag) 匹配标签名的元素,如div
find_element_by_css_selector(selector) 匹配CSS选择器
find_element_by_xpath(xpath) 匹配xpath
find_element_by_link_text(text) 完全匹配提供的text的a标签
find_element_by_partial_link_text(text) 提供的text可以是a标签中文本中的一部分



browser = webdriver.Chrome()

by @sunhaiyu


