search
HomeWeb Front-endJS TutorialScrapy and scrapy-splash framework quickly load js pages

1. Preface

When we use crawler programs to crawl web pages, crawling static pages is generally relatively simple, and we have written many cases before. But how to crawl pages dynamically loaded using js?

There are several crawling methods for dynamic js pages:

  1. Achieved through selenium+phantomjs.

  • phantomjs is a headless browser, selenium is an automated testing framework, request the page through the headless browser, wait for js to load, and then obtain the data through automated testing selenium . Because headless browsers consume a lot of resources, they are lacking in performance.

  • Scrapy-splash framework:

    • Splash as a js rendering service is lightweight based on Twisted and QT development Browser engine and provides direct http api. The fast and lightweight features make it easy for distributed development.

    • The splash and scrapy crawler frameworks are integrated. The two are compatible with each other and have better crawling efficiency.

    2. Splash environment construction

    The Splash service is based on docker containers, so we need to install docker containers first.

    2.1 Docker installation (windows 10 home version)

    If it is win 10 professional version or other operating systems, it is easier to install. To install docker in windows 10 home version, you need to go through toolbox ( Requires the latest) tools to be installed.

    Regarding the installation of docker, refer to the document: Install Docker on WIN10

    2.2 splash installation

    docker pull scrapinghub/splash

    2.3 Start the Splash service

    docker run -p 8050:8050 scrapinghub/splash

    Scrapy and scrapy-splash framework quickly load js pages

    At this time, open your browser and enter 192.168.99.100:8050. You will see an interface like this.

    Scrapy and scrapy-splash framework quickly load js pages

    You can enter any URL in the red box in the picture above and click Render me! to see what it will look like after rendering

    2.4 Install python Scrapy-splash package

    pip install scrapy-splash

    3. Scrapy crawler loading js project test, taking google news as an example.

    Due to business needs, we crawl some foreign news websites, such as Google News. But I found that it was actually js code. So I started to use the scrapy-splash framework and cooperated with Splash's js rendering service to obtain data. See the following code for details:

    3.1 settings.py configuration information

    # 渲染服务的urlSPLASH_URL = 'http://192.168.99.100:8050'# 去重过滤器DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# 使用Splash的Http缓存HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'SPIDER_MIDDLEWARES = {    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }#下载器中间件DOWNLOADER_MIDDLEWARES = {    'scrapy_splash.SplashCookiesMiddleware': 723,    'scrapy_splash.SplashMiddleware': 725,    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }# 请求头DEFAULT_REQUEST_HEADERS = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    }# 管道ITEM_PIPELINES = {   'news.pipelines.NewsPipeline': 300,
    }

    3.2 items field definition

    class NewsItem(scrapy.Item):    # 标题
        title = scrapy.Field()    # 图片的url链接
        Scrapy and scrapy-splash framework quickly load js pages_url = scrapy.Field()    # 新闻来源
        source = scrapy.Field()    # 点击的url
        action_url = scrapy.Field()

    3.3 Spider code

    In the spider directory, create A new_spider.py file, the file content is as follows:

    from scrapy import Spiderfrom scrapy_splash import SplashRequestfrom news.items import NewsItemclass GoolgeNewsSpider(Spider):
        name = "google_news"
    
        start_urls = ["https://news.google.com/news/headlines?ned=cn&gl=CN&hl=zh-CN"]    def start_requests(self):
            for url in self.start_urls:            # 通过SplashRequest请求等待1秒
                yield SplashRequest(url, self.parse, args={'wait': 1})    def parse(self, response):
            for element in response.xpath('//p[@class="qx0yFc"]'):
                actionUrl = element.xpath('.//a[@class="nuEeue hzdq5d ME7ew"]/@href').extract_first()
                title = element.xpath('.//a[@class="nuEeue hzdq5d ME7ew"]/text()').extract_first()
                source = element.xpath('.//span[@class="IH8C7b Pc0Wt"]/text()').extract_first()
                Scrapy and scrapy-splash framework quickly load js pagesUrl = element.xpath('.//img[@class="lmFAjc"]/@src').extract_first()
    
                item = NewsItem()
                item['title'] = title
                item['Scrapy and scrapy-splash framework quickly load js pages_url'] = Scrapy and scrapy-splash framework quickly load js pagesUrl
                item['action_url'] = actionUrl
                item['source'] = source            yield item

    3.4 pipelines.py code

    Store the item data in the mysql database.

    • Create db_news database

    CREATE DATABASE db_news
    • Create tb_news table

    CREATE TABLE tb_google_news(
        id INT AUTO_INCREMENT,
        title VARCHAR(50),
        Scrapy and scrapy-splash framework quickly load js pages_url VARCHAR(200),
        action_url VARCHAR(200),
        source VARCHAR(30),    PRIMARY KEY(id)
    )ENGINE=INNODB DEFAULT CHARSET=utf8;

    NewsPipeline class

    class NewsPipeline(object):
        def __init__(self):
            self.conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='root', db='db_news',charset='utf8')
            self.cursor = self.conn.cursor()    def process_item(self, item, spider):
            sql = '''insert into tb_google_news (title,Scrapy and scrapy-splash framework quickly load js pages_url,action_url,source) values(%s,%s,%s,%s)'''
            self.cursor.execute(sql, (item["title"], item["Scrapy and scrapy-splash framework quickly load js pages_url"], item["action_url"], item["source"]))
            self.conn.commit()        return item    def close_spider(self):
            self.cursor.close()
            self.conn.close()

    3.5 Execute scrapy crawler

    Execute on the console:

    scrapy crawl google_news

    The following picture is displayed in the database:

    Scrapy and scrapy-splash framework quickly load js pages

    Related recommendations:

    Basic introduction to the scrapy command

    Installation Scrapy tutorial

    scrapy crawler framework Introduction

    The above is the detailed content of Scrapy and scrapy-splash framework quickly load js pages. For more information, please follow other related articles on the PHP Chinese website!

    Statement
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
    es6数组怎么去掉重复并且重新排序es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

    去掉重复并排序的方法:1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句,去掉数组中的重复元素,返回去重后的新数组;2、利用sort()对去重数组进行排序,语法“去重数组.sort()”。

    JavaScript的Symbol类型、隐藏属性及全局注册表详解JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

    本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题,包括了Symbol类型的描述、Symbol不会隐式转字符串等问题,下面一起来看一下,希望对大家有帮助。

    原来利用纯CSS也能实现文字轮播与图片轮播!原来利用纯CSS也能实现文字轮播与图片轮播!Jun 10, 2022 pm 01:00 PM

    怎么制作文字轮播与图片轮播?大家第一想到的是不是利用js,其实利用纯CSS也能实现文字轮播与图片轮播,下面来看看实现方法,希望对大家有所帮助!

    JavaScript对象的构造函数和new操作符(实例详解)JavaScript对象的构造函数和new操作符(实例详解)May 10, 2022 pm 06:16 PM

    本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于对象的构造函数和new操作符,构造函数是所有对象的成员方法中,最早被调用的那个,下面一起来看一下吧,希望对大家有帮助。

    JavaScript面向对象详细解析之属性描述符JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

    本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于面向对象的相关问题,包括了属性描述符、数据描述符、存取描述符等等内容,下面一起来看一下,希望对大家有帮助。

    javascript怎么移除元素点击事件javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

    方法:1、利用“点击元素对象.unbind("click");”方法,该方法可以移除被选元素的事件处理程序;2、利用“点击元素对象.off("click");”方法,该方法可以移除通过on()方法添加的事件处理程序。

    整理总结JavaScript常见的BOM操作整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

    本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于BOM操作的相关问题,包括了window对象的常见事件、JavaScript执行机制等等相关内容,下面一起来看一下,希望对大家有帮助。

    foreach是es6里的吗foreach是es6里的吗May 05, 2022 pm 05:59 PM

    foreach不是es6的方法。foreach是es3中一个遍历数组的方法,可以调用数组的每个元素,并将元素传给回调函数进行处理,语法“array.forEach(function(当前元素,索引,数组){...})”;该方法不处理空数组。

    See all articles

    Hot AI Tools

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Clothoff.io

    Clothoff.io

    AI clothes remover

    AI Hentai Generator

    AI Hentai Generator

    Generate AI Hentai for free.

    Hot Article

    R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
    2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
    Repo: How To Revive Teammates
    4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
    Hello Kitty Island Adventure: How To Get Giant Seeds
    3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

    Hot Tools

    WebStorm Mac version

    WebStorm Mac version

    Useful JavaScript development tools

    SublimeText3 Mac version

    SublimeText3 Mac version

    God-level code editing software (SublimeText3)

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Safe Exam Browser

    Safe Exam Browser

    Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

    Dreamweaver Mac version

    Dreamweaver Mac version

    Visual web development tools