search
HomeBackend DevelopmentPython TutorialScrapy crawler in action: crawling Maoyan movie ranking data

Scrapy crawler practice: crawling Maoyan movie ranking data

With the development of the Internet, data crawling has become an important part of the big data era. In the process of data crawling, crawler technology can be used to automatically obtain the data needed at the moment, process and analyze it. In recent years, Python has become one of the most popular programming languages. Among them, Scrapy is a powerful crawler framework based on Python. It has a wide range of applications and has attracted everyone's attention especially in the field of data crawling.

This article is based on the Scrapy framework to crawl Maoyan movie ranking data. The specific process is divided into four parts: analyzing the page structure, writing the crawler framework, parsing the page, and storing data.

1. Analyze the page structure

First, we need to analyze the structure of the Maoyan movie ranking page. For the convenience of operation, we use Google Chrome browser for page analysis and XPath to extract the required information.

As you can see, the Maoyan movie ranking page contains information about multiple movies, and each movie has an HTML code block similar to the picture below.

Our goal is to obtain the five data of the movie’s name, starring role, release time, movie poster link and rating from each HTML code block. Then we can press the F12 key to open the developer tools in the Google Chrome browser, then select the "Elements" tab, move the mouse to the target element we need to extract, right-click and select "Copy -> Copy XPath" .

The copied XPath path is as follows:

/html/body/div[3]/div/div[2]/dl/dd[1]/div/div/div[1 ]/p[1]/a/text()

where "/html/body/div[3]/div/div[2]/dl/dd" represents the parent node of the entire movie list, in order Scroll down to find the elements we need to extract.

2. Write a crawler framework

Next, we need to create a Scrapy project, please refer to Scrapy’s official documentation (https://docs.scrapy.org/en/latest/intro/ tutorial.html). After creating the project, create a new file named maoyan.py in the Spiders directory.

The following is our crawler framework code:

import scrapy
from maoyan.items import MaoyanItem

class MaoyanSpider(scrapy.Spider):

name = 'maoyan'
allowed_domains = ['maoyan.com']
start_urls = ['http://maoyan.com/board/4']

def parse(self, response):
    movies = response.xpath('//dl[@class="board-wrapper"]/dd')
    for movie in movies:
        item = MaoyanItem()
        item['title'] = movie.xpath('.//p[@class="name"]/a/@title').extract_first()
        item['actors'] = movie.xpath('.//p[@class="star"]/text()').extract_first().strip()
        item['release_date'] = movie.xpath('.//p[@class="releasetime"]/text()').extract_first().strip()
        item['image_url'] = movie.xpath('.//img/@data-src').extract_first()
        item['score'] = movie.xpath('.//p[@class="score"]/i[@class="integer"]/text()').extract_first() + 
                        movie.xpath('.//p[@class="score"]/i[@class="fraction"]/text()').extract_first()
        yield item

In the code, we first define Spider's name, allowed_domains and start_urls. Among them, "allowed_domains" means that only URLs belonging to this domain name will be accessed and extracted by the crawler. At the same time, "start_urls" indicates the first URL address that the crawler will request.

Spider's parse method receives the content from the response, and then extracts five data items of each movie's name, starring role, release time, movie poster link, and rating through the XPath path, and saves them to MaoyanItem.

Finally, we returned each Item object through the yield keyword. Note: The Item object we defined is in a file named items.py and needs to be imported.

3. Parse the page

When the crawler locates the page we need to crawl, we can start to parse the HTML document and extract the information we need. This part of the content mainly focuses on XPath query and regular expression processing of response objects in Scrapy.

In this example, we use the XPath path to extract five pieces of data for each movie in the Maoyan movie ranking page.

4. Store data

After the data is parsed, we need to store it. Generally speaking, we store the obtained data in a file or save it to a database.

In this example, we choose to save the data to a .csv file:

import csv

class MaoyanPipeline(object):

def __init__(self):
    self.file = open('maoyan_top100_movies.csv', 'w', newline='', encoding='utf-8')
    self.writer = csv.writer(self.file)

def process_item(self, item, spider):
    row = [item['title'], item['actors'], item['release_date'], item['image_url'], item['score']]
    self.writer.writerow(row)
    return item

def close_spider(self, spider):
    self.file.close()

In the above code, we use Python's internal csv module to write data to a file named maoyan_top100_movies.csv. When the spider is closed, the csv file will also be closed.

Summary

Through this article, we learned how to use the Scrapy framework to crawl Maoyan movie ranking data. First we analyzed the page structure, and then wrote the Scrapy framework to crawl data, parse the page and store data. In actual combat, we can learn how to unify legality, usability and efficiency in capturing data.

The above is the detailed content of Scrapy crawler in action: crawling Maoyan movie ranking data. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
PHP 爬虫实战:爬取 Twitter 上的数据PHP 爬虫实战:爬取 Twitter 上的数据Jun 13, 2023 pm 01:17 PM

在数字化时代下,社交媒体已经成为人们生活中不可或缺的一部分。Twitter作为其中的代表,每天有数亿用户在上面分享各种信息。对于一些研究、分析、推销等需求,获取Twitter上的相关数据是非常必要的。本文将介绍如何使用PHP编写一个简单的Twitter爬虫,爬取一些关键字相关的数据并存储在数据库中。一、TwitterAPITwitter提供

爬虫实战:用 PHP 爬取京东商品信息爬虫实战:用 PHP 爬取京东商品信息Jun 13, 2023 am 11:11 AM

在当今的电商时代,京东作为中国最大的综合电商之一,每日上架的商品数量甚至可以达到数万种。对于广大的消费者来说,京东提供了广泛的商品选择和优势的价格优惠。但是,有些时候,我们需要批量获取京东商品信息,快速筛选、比较、分析等等。这时候,我们就需要用到爬虫技术了。在本篇文章中,我们将会介绍利用PHP语言编写爬虫,帮助我们快速爬取京东商品信息的实现。准备工作首先,我

爬虫技巧:如何在 PHP 中处理 Cookie爬虫技巧:如何在 PHP 中处理 CookieJun 13, 2023 pm 02:54 PM

在爬虫开发中,处理Cookie常常是必不可少的一环。Cookie作为HTTP中的一种状态管理机制,通常被用来记录用户的登录信息和行为,是爬虫处理用户验证和保持登录状态的关键。在PHP爬虫开发中,处理Cookie需要掌握一些技巧和留意一些坑点。下面我们详细介绍如何在PHP中处理Cookie。一、如何获取Cookie在使用PHP编写

爬虫实战:使用PHP爬取携程旅游信息爬虫实战:使用PHP爬取携程旅游信息Jun 13, 2023 am 10:26 AM

随着旅游业的不断发展,旅游信息变得非常丰富。为了方便大家获取更全面、准确的旅游信息,我们可以使用爬虫来抓取旅游网站上的数据,并进行分析和处理。本文将介绍如何使用PHP爬取携程旅游信息。爬虫基础知识爬虫是一种自动化程序,可以模拟用户访问网站并获取网站上的数据。爬虫一般分为以下几步:发起请求:爬虫程序会向目标网站发起HTTP请求,获取目标网站的HTML代码。解析

Python中的爬虫实战:微信公众号爬虫Python中的爬虫实战:微信公众号爬虫Jun 10, 2023 am 09:01 AM

Python是一种优雅的编程语言,拥有强大的数据处理和网络爬虫功能。在这个数字化时代,互联网上充满了大量的数据,爬虫已成为获取数据的重要手段,因此,Python爬虫在数据分析和挖掘方面有着广泛的应用。在本文中,我们将介绍如何使用Python爬虫来获取微信公众号文章信息。微信公众号是一种流行的社交媒体平台,用于在线发布文章,是许多公司和自媒体推广和营销的重要工

PHP 爬虫实战:爬取百度搜索结果PHP 爬虫实战:爬取百度搜索结果Jun 13, 2023 pm 12:39 PM

随着互联网的发展,我们可以通过各种搜索引擎轻易地获得各种信息。而对于开发者来说,如何从搜索引擎中获取各种数据,是一项非常重要的技能。今天,我们来学习如何使用PHP编写一个爬虫,来爬取百度搜索结果。一、爬虫工作原理在开始之前,我们先来了解一下爬虫工作的基本原理。首先,爬虫会发送请求给服务器,请求网站的内容。服务器接收到请求之后,会返回网页的内容。爬虫收到内

使用 PHP 和 Selenium WebDriver 实现爬虫使用 PHP 和 Selenium WebDriver 实现爬虫Jun 13, 2023 am 10:06 AM

随着互联网的蓬勃发展,我们可以轻松地获取海量的数据。而爬虫则是其中一种常见的数据获取方式,特别是在需要大量数据的数据分析和研究领域中,爬虫的应用越来越广泛。本文将介绍如何使用PHP和SeleniumWebDriver实现爬虫。一、什么是SeleniumWebDriver?SeleniumWebDriver是一种自动化测试工具,主要用于模拟人

Python中的爬虫实战:酷我音乐爬虫Python中的爬虫实战:酷我音乐爬虫Jun 09, 2023 pm 11:43 PM

在互联网的时代,信息变得异常重要,数据成为价值的基石之一,而网络爬虫则是获取和处理数据的重要工具之一。Python语言由于其简单易学、高效的特点,成为众多爬虫程序员的首选编程语言。在本文中,我们将通过一个实战案例,使用Python语言来爬取酷我音乐网站的数据,并进行数据分析和处理。酷我音乐是国内较为知名的音乐播放器之一,其拥有海量的音乐资源和用户群体。我们将

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools