Python Tutorial

How selenium+python crawls Jianshu website

零到壹度

Apr 16, 2018 am 09:52 AM

pythonseleniumSimple book

Page loading logic

When you enthusiastically learn basic crawler knowledge from the Internet, just find a goal and practice it , the short book with a large number of articles contains a lot of valuable information, so it will naturally become your choice. If you try it, you will find that it is not as simple as you think, because it contains a lot of js-related data transmission. Let me use a traditional crawler to demonstrate it first: >

Open the homepage of the Jianshu, there seems to be nothing special

jianshu home page

Open the developer mode of chrome, and found that the title of the article and href are all in the a tag, and there seems to be none What’s different

##a.png

The next step is to find all the
a tags on the page, But wait, if you look carefully, you will find that when the pulley is halfway rolled, the page will load more. This step will be repeated three times until the Read more button appears at the bottom.

Pulley

Not only that but the read morehref at the bottom does not tell us to load the rest of the page information , the only way iskeep clicking the read more button

##load_more.png

What,

Repeat the pulley three times to slide the center of the page and keep clicking the button This kind of operation http request cannot be done, is this more like a js operation? That's right, Jianshu's article is not a regular http request. We cannot constantly redirect according to different URLs, but some actions on the page to load the page information.

Selenium introduction

Selenium is a web automation testing tool that supports many languages. We can use python’s selenium here When used as a crawler, in the process of crawling short books, its working principle is to continuously inject js code, let the page load continuously, and finally extract all the

a tags. First you have to download the selenium package in python

>>> pip3 install selenium

chromedriver

Selenium must be used with a browser. Here I use chromedriver, an open source beta version of Chrome. You can use the headless mode to access web pages without displaying the front paragraph, which is the biggest feature.

python中操作

在写代码之前一定要把chromedriver同一文件夹内，因为我们需要引用PATH，这样方便点。首先我们的第一个任务是刷出加载更多的按钮，需要做3次将滑轮重复三次滑倒页面的中央，这里方便起见我滑到了底部

from selenium import webdriverimport time
browser = webdriver.Chrome("./chromedriver")
browser.get("https://www.jianshu.com/")for i in range(3):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") // execute_script是插入js代码的
    time.sleep(2) //加载需要时间，2秒比较合理

看看效果

刷出了按钮

接下来就是不断点击按钮加载页面，继续加入刚才的py文件之中

for j in range(10):  //这里我模拟10次点击    
    try: 
        button = browser.execute_script("var a = document.getElementsByClassName('load-more'); a[0].click();")
        time.sleep(2)    except:        pass'''
 上面的js代码说明一下
 var a = document.getElementsByClassName('load-more');选择load-more这个元素
 a[0].click(); 因为a是一个集合，索引0然后执行click()函数
'''

这个我就不贴图了，成功之后就是不断地加载页面，知道循环完了为止，接下来的工作就简单很多了，就是寻找a标签，get其中的text和href属性,这里我直接把它们写在了txt文件之中.

titles = browser.find_elements_by_class_name("title")with open("article_jianshu.txt", "w", encoding="utf-8") as f:    for t in titles:        try:
            f.write(t.text + " " + t.get_attribute("href"))
            f.write("\n")        except TypeError:            pass

最终结果

简书文章

headless模式

不断加载页面肯定也很烦人，所以我们测试成功之后并不想把浏览器显示出来，这需要加上headless模式

options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome("./chromedriver", chrome_options=options) //把上面的browser加入chrome_options参数

总结

当我们没办法使用正常的http请求爬取时，可以使用selenium操纵浏览器来抓取我们想要的内容，这样有利有弊，比如

优点

可以暴力爬虫
简书并不需要cookie才能查看文章，不需要费劲心思找代理，或者说我们可以无限抓取并且不会被ban
首页应该为ajax传输，不需要额外的http请求

缺点

爬取速度太满，想象我们的程序，点击一次需要等待2秒那么点击600次需要1200秒, 20分钟...

附加

这是所有完整的代码

from selenium import webdriverimport time

options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome("./chromedriver", chrome_options=options)

browser.get("https://www.jianshu.com/")for i in range(3):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)# print(browser)for j in range(10):    try:
        button = browser.execute_script("var a = document.getElementsByClassName('load-more'); a[0].click();")
        time.sleep(2)    except:        pass#titles = browser.find_elements_by_class_name("title")with open("article_jianshu.txt", "w", encoding="utf-8") as f:    for t in titles:        try:
            f.write(t.text + " " + t.get_attribute("href"))
            f.write("\n")        except TypeError:            pass

相关推荐：

[python爬虫] Selenium爬取新浪微博内容及用户信息

[Python爬虫]利用Selenium等待Ajax加载及模拟自动翻页，爬取东方财富网公司公告

Python爬虫：Selenium+ BeautifulSoup 爬取JS渲染的动态内容（雪球网新闻）

The above is the detailed content of How selenium+python crawls Jianshu website. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Laravel开发：如何使用Laravel Dusk和Selenium进行浏览器测试？Jun 14, 2023 pm 01:53 PM

Laravel开发：如何使用LaravelDusk和Selenium进行浏览器测试？随着Web应用程序变得越来越复杂，我们需要确保其各个部分都能正常运行。浏览器测试是一种常见的测试方法，用于确保应用在各种不同浏览器下的正确性和稳定性。在Laravel开发中，可以使用LaravelDusk和Selenium进行浏览器测试。本文将介绍如何使用这两个工具进行测

利用Java、Selenium和OpenCV结合的方法，解决自动化测试中滑块验证问题。May 08, 2023 pm 08:16 PM

1、滑块验证思路被测对象的滑块对象长这个样子。相对而言是比较简单的一种形式，需要将左侧的拼图通过下方的滑块进行拖动，嵌入到右侧空槽中，即完成验证。要自动化完成这个验证过程，关键点就在于确定滑块滑动的距离。根据上面的分析，验证的关键点在于确定滑块滑动的距离。但是看似简单的一个需求，完成起来却并不简单。如果使用自然逻辑来分析这个过程，可以拆解如下：1.定位到左侧拼图所在的位置，由于拼图的形状和大小固定，那么其实只需要定位其左边边界离背景图片的左侧距离。（实际在本例中，拼图的起始位置也是固定的，节省了

如何使用Selenium进行Web自动化测试Aug 02, 2023 pm 07:43 PM

如何使用Selenium进行Web自动化测试概述：Web自动化测试是现代软件开发过程中至关重要的一环。Selenium是一个强大的自动化测试工具，可以模拟用户在Web浏览器中的操作，实现自动化的测试流程。本文将介绍如何使用Selenium进行Web自动化测试，并附带代码示例，帮助读者快速上手。环境准备在开始之前，需要安装Selenium库和Web浏览器驱动程

高效率爬取网页数据：PHP和Selenium的结合使用Jun 15, 2023 pm 08:36 PM

随着互联网技术的飞速发展，Web应用程序越来越多地应用于我们的日常工作和生活中。而在Web应用程序开发过程中，爬取网页数据是一项非常重要的任务。虽然市面上有很多的Web抓取工具，但是这些工具的效率都不是很高。为了提高网页数据爬取的效率，我们可以利用PHP和Selenium的结合使用。首先，我们需要了解一下PHP和Selenium分别是什么。PHP是一种强大的

pycharm如何安装seleniumDec 08, 2023 pm 02:32 PM

pycharm安装selenium步骤：1、打开PyCharm；2、在菜单栏中选择依次选择 "File"、"Settings"、"Project: [项目名称]"；3、选择 Project Interpreter；4、点击选项卡右侧的"+"；5、在弹出的窗口搜索selenium；6、找到selenium点击旁边的"Install"按钮；7、等待安装完成；8、关闭设置对话框即可。

在Scrapy爬虫中使用Selenium和PhantomJSJun 22, 2023 pm 06:03 PM

在Scrapy爬虫中使用Selenium和PhantomJSScrapy是Python下的一个优秀的网络爬虫框架，已经被广泛应用于各个领域中的数据采集和处理。在爬虫的实现中，有时候需要模拟浏览器操作去获取某些网站呈现的内容，这时候就需要用到Selenium和PhantomJS。Selenium是模拟人类对浏览器的操作，让我们可以自动化地进行Web应用程序测试

Python中如何使用Selenium爬取网页数据May 09, 2023 am 11:05 AM

一.什么是Selenium网络爬虫是Python编程中一个非常有用的技巧，它可以让您自动获取网页上的数据。Selenium是一个自动化测试工具，它可以模拟用户在浏览器中的操作，比如点击按钮、填写表单等。与常用的BeautifulSoup、requests等爬虫库不同，Selenium可以处理JavaScript动态加载的内容，因此对于那些需要模拟用户交互才能获取的数据，Selenium是一个非常合适的选择。二.安装Selenium要使用Selenium，首先需要安装它。您可以使用pip命令来安装

使用Python中的Selenium关闭特定的网页Sep 08, 2023 pm 11:25 PM

Python凭借其简单性和多功能性，在全球开发人员中获得了广泛的欢迎。其广泛的库和框架使程序员能够完成各种任务，包括Web自动化。当谈到自动化Web浏览器时，Python生态系统中的强大工具Selenium占据了中心舞台。Selenium提供了一个用户友好的界面来与网页交互，使其成为Web测试、抓取和自动化任务不可或缺的工具。在本教程中，我们将深入研究Python和Selenium的世界，探索一项特定任务：以编程方式关闭网页。您是否曾经发现自己正在处理多个浏览器窗口或选项卡，并且想要关闭特定页面

See all articles