search
HomeBackend DevelopmentPython TutorialA brief introduction to the usage of Beautifulsoup and selenium
A brief introduction to the usage of Beautifulsoup and seleniumJul 20, 2017 am 09:42 AM
beautifulsoupseleniumuse

Simple use of Beautifulsoup and selenium

Review of requests library

I haven’t used it for a long timerequests, because I will write a simple crawler later, so I just write it casually A little review.

import requests

r = requests.get('https://api.github.com/user', auth=('haiyu19931121@163.com', 'Shy18137803170'))print(r.status_code)  # 状态码200print(r.json())  # 返回json格式print(r.text)  # 返回文本print(r.headers)  # 头信息print(r.encoding)  # 编码方式,一般utf-8# 当写入文件比较大时,避免内存耗尽,可以一次写指定的字节数或者一行。# 一次读一行,chunk_size=512为默认值for chunk in r.iter_lines():print(chunk)# 一次读取一块,大小为512for chunk in r.iter_content(chunk_size=512):print(chunk)

Note that iter_lines and iter_content return byte data. To write to a file, whether it is text or Pictures need to be opened in the wb way.

Using Beautifulsoup

Let’s get to the point. I have heard about this famous library for a long time. Although it was not troublesome to use regular expressions to write crawlers in the past, sometimes the matching would be inaccurate. Use Beautifulsoup to accurately extract data from HTML tags. Although it is a bit slow, it is simple and easy to use.

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse&#39;s story</title></head><body><p class="title"><b>The Dormouse&#39;s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""# 就注意一点,第二个参数指定解析器,必须填上,不然会有警告。推荐使用lxmlsoup = BeautifulSoup(html_doc, &#39;lxml&#39;)

Following the above code, look at some simple operations below. The behavior of using point attributes will get the first found data that meets the conditions. It is the abbreviation of find method.

soup.a
soup.find(&#39;p&#39;)

The above two sentences are equivalent.

# soup.body是一个Tag对象。是body标签中所有html代码print(soup.body)
<p><b>The Dormouse's story</b></p>
<p>Once upon a time there were three little sisters; and their names were
<a>Elsie</a>,
<a>Lacie</a> and
<a>Tillie</a>;
and they lived at the bottom of a well.</p>
<p>...</p>
# 获取body里所有文本,不含标签print(soup.body.text)# 等同于下面的写法soup.body.get_text()# 还可以这样写,strings是所有文本的生成器for string in soup.body.strings:print(string, end=&#39;&#39;)
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
# 获得该标签里的文本。print(soup.title.string)
The Dormouse's story
# Tag对象的get方法可以根据属性的名称获得属性的值,此句表示得到第一个p标签里class属性的值print(soup.p.get(&#39;class&#39;))# 和下面的写法等同print(soup.p[&#39;class&#39;])
['title']
# 查看a标签的所有属性,以字典形式给出print(soup.a.attrs)
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
# 标签的名称soup.title.name
title

find_all

The most commonly used method is undoubtedly the find_all / find method. The former finds all data that meets the conditions and returns a list. The latter is the first data in this list. find_all has a limit parameter that limits the length of the list (that is, the number of data that meets the search criteria). When limit=1 actually becomes the find method.

find_allThere are also abbreviations.

soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)
soup(&#39;a&#39;, id=&#39;link1&#39;)

The above two ways of writing are equivalent, and the second way of writing is an abbreviation.

find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs)

name

name is the tag you want to search for. For example, the following is to find all p tags. Not only can you fill in strings, but you can also pass in regular expressions, lists, functions, and True.

# 传入字符串soup.find_all(&#39;p&#39;)# 传入正则表达式import re# 必须以b开头for tag in soup.find_all(re.compile("^b")):print(tag.name)# body# b# 含有t就行for tag in soup.find_all(re.compile("t")):print(tag.name)# html# title# 传入列表表示,一次查找多个标签soup.find_all(["a", "b"])# [<b>The Dormouse&#39;s story</b>,#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

If you pass in True, there will be no restrictions and everything will be searched.

recursive

When calling the find_all() method of tag, Beautiful Soup will retrieve all descendant nodes of the current tag. If you only want to search for the direct child nodes of the tag, you can Use parameter recursive=False.

# title不是html的直接子节点,但是会检索其下所有子孙节点soup.html.find_all("title")# [<title>The Dormouse&#39;s story</title>]# 参数设置为False,只会找直接子节点soup.html.find_all("title", recursive=False)# []# title就是head的直接子节点,所以这个参数此时无影响a = soup.head.find_all("title", recursive=False)# [<title name="good">The Dormouse&#39;s story</title>]

keyword and attrs

Use keyword and add one or more qualifications to narrow the search scope.

# 查看所有id为link1的p标签soup.find_all(&#39;a&#39;, id=&#39;link1&#39;)

If you search by class, Python has already used it because of the class keyword. You can use class_, or do not specify keywords, or use attrs to fill in the dictionary.

soup.find_all(&#39;p&#39;, class_=&#39;story&#39;)
soup.find_all(&#39;p&#39;, &#39;story&#39;)
soup.find_all(&#39;p&#39;, attrs={"class": "story"})

The above three methods are equivalent. class_Can accept strings, regular expressions, functions, and True.

text

Search for text value, it seems that using string parameter also gives the same result.

a = soup.find_all(text=&#39;Elsie&#39;)# 或者,4.4以上版本请使用texta = soup.find_all(string=&#39;Elsie&#39;)

The text parameter can also accept strings, regular expressions, True, and lists.

CSS Selector

You can also use CSS selector. Just use the select method, select always returns a list.

List several commonly used operations.

# 所有div标签soup.select(&#39;div&#39;)# 所有id为username的元素soup.select(&#39;.username&#39;)# 所有class为story的元素soup.select(&#39;#story&#39;)# 所有div元素之内的span元素,中间可以有其他元素soup.select(&#39;div span&#39;)# 所有div元素之内的span元素,中间没有其他元素soup.select(&#39;div > span&#39;)# 所有具有一个id属性的input标签,id的值无所谓soup.select(&#39;input[id]&#39;)# 所有具有一个id属性且值为user的input标签soup.select(&#39;input[id="user"]&#39;)# 搜索多个,class为link1或者link2的元素都符合soup.select("#link1, #link2")

A small crawler example

The basic usage of requests and beautifulsoup4 is introduced above. Using these, you can already write some simple crawlers. Come and try it.

This example comes from "Get Started Quickly with Python Programming - Automate Cumbersome Work" [US] AI Sweigart

This crawler will download pictures from XKCD Comics Network in batches. You can specify the number of pages to download.

import osimport requestsfrom bs4 import BeautifulSoup# exist_ok=True,若文件夹已经存在也不会报错os.makedirs(&#39;xkcd&#39;)
url = &#39;https://xkcd.com/&#39;headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def save_img(img_url, limit=1):
    r = requests.get(img_url, headers=headers)
    soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
        img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
        response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024*1024):
                f.write(chunk)# 每次下载一张图片,就减1limit -= 1# 找到上一张图片的网址if limit > 0:try:
            prev = &#39;https://xkcd.com&#39; + soup.find(&#39;a&#39;, rel=&#39;prev&#39;).get(&#39;href&#39;)except AttributeError:print(&#39;Link Not Exist&#39;)else:
            save_img(prev, limit)if __name__ == &#39;__main__&#39;:
    save_img(url, limit=20)print(&#39;Done!&#39;)
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
Downloading 
...
Done!

Multi-threaded download

Single-threaded speed is a bit slow, for example, multi-threading can be used, because when we get prev, Knowing the URL of each web page is very regular. It goes like this. Only the last number is different, so we can easily use range to traverse.

import osimport threadingimport requestsfrom bs4 import BeautifulSoup

os.makedirs(&#39;xkcd&#39;)

headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) &#39;  &#39;Chrome/57.0.2987.98 Safari/537.36&#39;}def download_imgs(start, end):for url_num in range(start, end):
        img_url = &#39;https://xkcd.com/&#39; + str(url_num)
        r = requests.get(img_url, headers=headers)
        soup = BeautifulSoup(r.text, &#39;lxml&#39;)try:
            img = &#39;https:&#39; + soup.find(&#39;div&#39;, id=&#39;comic&#39;).img.get(&#39;src&#39;)except AttributeError:print(&#39;Image Not Found&#39;)else:print(&#39;Downloading&#39;, img)
            response = requests.get(img, headers=headers)with open(os.path.join(&#39;xkcd&#39;, os.path.basename(img)), &#39;wb&#39;) as f:for chunk in response.iter_content(chunk_size=1024 * 1024):
                    f.write(chunk)if __name__ == &#39;__main__&#39;:# 下载从1到30,每个线程下载10个threads = []for i in range(1, 30, 10):
        thread_obj = threading.Thread(target=download_imgs, args=(i, i + 10))
        threads.append(thread_obj)
        thread_obj.start()# 阻塞,等待线程执行结束都会等待for thread in threads:
        thread.join()# 所有线程下载完毕,才打印print(&#39;Done!&#39;)

来看下结果吧。

A brief introduction to the usage of Beautifulsoup and selenium

初步了解selenium

selenium用来作自动化测试。使用前需要下载驱动,我只下载了Firefox和Chrome的。网上随便一搜就能下载到了。接下来将下载下来的文件其复制到将安装目录下,比如Firefox,将对应的驱动程序放到C:\Program Files (x86)\Mozilla Firefox,并将这个路径添加到环境变量中,同理Chrome的驱动程序放到C:\Program Files (x86)\Google\Chrome\Application并将该路径添加到环境变量。最后重启IDE开始使用吧。

模拟百度搜索

下面这个例子会打开Chrome浏览器,访问百度首页,模拟输入The Zen of Python,随后点击百度一下,当然也可以用回车代替。Keys下是一些不能用字符串表示的键,比如方向键、Tab、Enter、Esc、F1~F12、Backspace等。然后等待3秒,页面跳转到知乎首页,接着返回到百度,最后退出(关闭)浏览器。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport time

browser = webdriver.Chrome()# Chrome打开百度首页browser.get(&#39;https://www.baidu.com/&#39;)# 找到输入区域input_area = browser.find_element_by_id(&#39;kw&#39;)# 区域内填写内容input_area.send_keys(&#39;The Zen of Python&#39;)# 找到"百度一下"search = browser.find_element_by_id(&#39;su&#39;)# 点击search.click()# 或者按下回车# input_area.send_keys(&#39;The Zen of Python&#39;, Keys.ENTER)time.sleep(3)
browser.get(&#39;https://www.zhihu.com/&#39;)
time.sleep(2)# 返回到百度搜索browser.back()
time.sleep(2)# 退出浏览器browser.quit()

A brief introduction to the usage of Beautifulsoup and selenium

send_keys模拟输入内容。可以使用element的clear()方法清空输入。一些其他模拟点击浏览器按钮的方法如下

browser.back()  # 返回按钮browser.forward() # 前进按钮browser.refresh()  # 刷新按钮browser.close()  # 关闭当前窗口browser.quit()  # 退出浏览器

查找方法

以下列举常用的查找Element的方法。

方法名 返回的WebElement
find_element_by_id(id) 匹配id属性值的元素
find_element_by_name(name) 匹配name属性值的元素
find_element_by_class_name(name) 匹配CSS的class值的元素
find_element_by_tag_name(tag) 匹配标签名的元素,如div
find_element_by_css_selector(selector) 匹配CSS选择器
find_element_by_xpath(xpath) 匹配xpath
find_element_by_link_text(text) 完全匹配提供的text的a标签
find_element_by_partial_link_text(text) 提供的text可以是a标签中文本中的一部分

登录CSDN

以下代码可以模拟输入账号密码,点击登录。整个过程还是很快的。

browser = webdriver.Chrome()
browser.get(&#39;https://passport.csdn.net/account/login&#39;)
browser.find_element_by_id(&#39;username&#39;).send_keys(&#39;haiyu19931121@163.com&#39;)
browser.find_element_by_id(&#39;password&#39;).send_keys(&#39;**********&#39;)
browser.find_element_by_class_name(&#39;logging&#39;).click()

A brief introduction to the usage of Beautifulsoup and selenium

以上差不多都是API的罗列,其中有自己的理解,也有照搬官方文档的。


by @sunhaiyu

2017.7.13

The above is the detailed content of A brief introduction to the usage of Beautifulsoup and selenium. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Laravel开发:如何使用Laravel Dusk和Selenium进行浏览器测试?Laravel开发:如何使用Laravel Dusk和Selenium进行浏览器测试?Jun 14, 2023 pm 01:53 PM

Laravel开发:如何使用LaravelDusk和Selenium进行浏览器测试?随着Web应用程序变得越来越复杂,我们需要确保其各个部分都能正常运行。浏览器测试是一种常见的测试方法,用于确保应用在各种不同浏览器下的正确性和稳定性。在Laravel开发中,可以使用LaravelDusk和Selenium进行浏览器测试。本文将介绍如何使用这两个工具进行测

使用Python的Requests和BeautifulSoup下载PDF文件使用Python的Requests和BeautifulSoup下载PDF文件Aug 30, 2023 pm 03:25 PM

Request和BeautifulSoup是可以在线下载任何文件或PDF的Python库。请求库用于发送HTTP请求和接收响应。BeautifulSoup库用于解析响应中收到的HTML并获取可下载的pdf链接。在本文中,我们将了解如何在Python中使用Request和BeautifulSoup下载PDF。安装依赖项在Python中使用BeautifulSoup和Request库之前,我们需要使用pip命令在系统中安装这些库。要安装request以及BeautifulSoup和Request库,

利用Java、Selenium和OpenCV结合的方法,解决自动化测试中滑块验证问题。利用Java、Selenium和OpenCV结合的方法,解决自动化测试中滑块验证问题。May 08, 2023 pm 08:16 PM

1、滑块验证思路被测对象的滑块对象长这个样子。相对而言是比较简单的一种形式,需要将左侧的拼图通过下方的滑块进行拖动,嵌入到右侧空槽中,即完成验证。要自动化完成这个验证过程,关键点就在于确定滑块滑动的距离。根据上面的分析,验证的关键点在于确定滑块滑动的距离。但是看似简单的一个需求,完成起来却并不简单。如果使用自然逻辑来分析这个过程,可以拆解如下:1.定位到左侧拼图所在的位置,由于拼图的形状和大小固定,那么其实只需要定位其左边边界离背景图片的左侧距离。(实际在本例中,拼图的起始位置也是固定的,节省了

高效率爬取网页数据:PHP和Selenium的结合使用高效率爬取网页数据:PHP和Selenium的结合使用Jun 15, 2023 pm 08:36 PM

随着互联网技术的飞速发展,Web应用程序越来越多地应用于我们的日常工作和生活中。而在Web应用程序开发过程中,爬取网页数据是一项非常重要的任务。虽然市面上有很多的Web抓取工具,但是这些工具的效率都不是很高。为了提高网页数据爬取的效率,我们可以利用PHP和Selenium的结合使用。首先,我们需要了解一下PHP和Selenium分别是什么。PHP是一种强大的

在Scrapy爬虫中使用Selenium和PhantomJS在Scrapy爬虫中使用Selenium和PhantomJSJun 22, 2023 pm 06:03 PM

在Scrapy爬虫中使用Selenium和PhantomJSScrapy是Python下的一个优秀的网络爬虫框架,已经被广泛应用于各个领域中的数据采集和处理。在爬虫的实现中,有时候需要模拟浏览器操作去获取某些网站呈现的内容,这时候就需要用到Selenium和PhantomJS。Selenium是模拟人类对浏览器的操作,让我们可以自动化地进行Web应用程序测试

Python服务器编程:使用BeautifulSoup进行HTML解析Python服务器编程:使用BeautifulSoup进行HTML解析Jun 18, 2023 am 10:32 AM

Python服务器编程是Web开发的关键方向之一,这涉及到许多任务,包括HTML解析。在Python中,我们有许多强大的库可以用来处理HTML文件,其中最流行的是BeautifulSoup。本文将介绍如何使用Python和BeautifulSoup从HTML文件中提取数据。我们将通过以下步骤进行:安装BeautifulSoup载入HTML文件创建Beauti

Python中如何使用Selenium爬取网页数据Python中如何使用Selenium爬取网页数据May 09, 2023 am 11:05 AM

一.什么是Selenium网络爬虫是Python编程中一个非常有用的技巧,它可以让您自动获取网页上的数据。Selenium是一个自动化测试工具,它可以模拟用户在浏览器中的操作,比如点击按钮、填写表单等。与常用的BeautifulSoup、requests等爬虫库不同,Selenium可以处理JavaScript动态加载的内容,因此对于那些需要模拟用户交互才能获取的数据,Selenium是一个非常合适的选择。二.安装Selenium要使用Selenium,首先需要安装它。您可以使用pip命令来安装

从零开始:如何使用PHP和Selenium构建网络数据爬虫从零开始:如何使用PHP和Selenium构建网络数据爬虫Jun 15, 2023 pm 12:34 PM

随着互联网的发展,网络数据爬取越来越成为人们关注的焦点。网络数据爬虫可以从互联网中采集大量有用的数据,为企业、学术研究和个人分析提供支持。本文将介绍使用PHP和Selenium构建网络数据爬虫的方法和步骤。一、什么是网络数据爬虫?网络数据爬虫是指自动化程序,在互联网中采集指定网站的数据。网络数据爬虫使用不同的技术和工具来实现,其中最常用的技术是使用编程语言和

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),