Home  >  Article  >  Backend Development  >  Scrapy in action: crawling Douban movie data and rating popularity rankings

Scrapy in action: crawling Douban movie data and rating popularity rankings

WBOY
WBOYOriginal
2023-06-22 13:49:402407browse

Scrapy is an open source Python framework for crawling data quickly and efficiently. In this article, we will use Scrapy to crawl the data and rating popularity of Douban movies.

  1. Preparation

First, we need to install Scrapy. You can install Scrapy by entering the following command at the command line:

pip install scrapy

Next, we will create a Scrapy project. At the command line, enter the following command:

scrapy startproject doubanmovie

This will create a Scrapy project named doubanmovie. We will then go into the project directory and create a spider called douban.py. At the command line, enter the following command:

cd doubanmovie
scrapy genspider douban douban.com

Now, we have a Spider ready to use. Next, we will define the spider's behavior to get the required data.

  1. Crawling movie data

We will use Spider to crawl Douban movie data. Specifically, we will get the following information:

  • Movie Name
  • Director
  • Actor
  • Type
  • Country
  • Language
  • Release date
  • Length
  • Rating
  • Number of reviewers

Open douban.py file, we will add the following code:

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for movie in movie_list:
            yield {
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }

In this code, we use XPath to select the information we need to get. We use yield to generate this information and return to return it to the user.

If we run our Spider now (run the following command: scrapy crawl douban), it will crawl the data for the first 250 movies and return them to the command line.

  1. Get rating popularity ranking

Now, we have successfully obtained the data of the top 250 movies. Next, we will get their rating popularity ranking.

We need to create a new Spider first to crawl the TOP250 list of Douban movies. We will use this list to get the ranking of the movies.

In the douban.py file, we will add the following code:

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for movie in movie_list:
            yield {
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }

        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].get())
            yield scrapy.Request(url, callback=self.parse)

In the code, we use a variable called next_page to check if we have reached the last page. If we haven't reached the last page yet, we continue crawling to the next page.

Next, we need to update the parse method to get the ranking of the movie. We will use Python's enumerate function to associate a ranking with each movie.

In the douban.py file, we will replace the original parse method:

def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for i, movie in enumerate(movie_list):
            yield {
                'rank': i + 1,
                'name': movie.xpath('.//span[@class="title"]/text()').get(),
                'director': movie.xpath('.//div[@class="bd"]/p/text()[1]').get(),
                'actors': movie.xpath('.//div[@class="bd"]/p/text()[2]').get(),
                'genre': movie.xpath('.//div[@class="bd"]/p/text()[3]').get(),
                'country': movie.xpath('.//div[@class="bd"]/p/text()[4]').get(),
                'language': movie.xpath('.//div[@class="bd"]/p/text()[5]').get(),
                'release_date': movie.xpath('.//div[@class="bd"]/p/text()[6]').get(),
                'duration': movie.xpath('.//div[@class="bd"]/p/text()[7]').get(),
                'rating': movie.xpath('.//span[@class="rating_num"]/text()').get(),
                'num_reviews': movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get(),
            }

        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].get())
            yield scrapy.Request(url, callback=self.parse)

Now, if we run our Spider again, it will get the data for the first 250 movies and will They are returned to the command line. At this point, we will see the ranking of all movies.

  1. Conclusion

Scrapy is a very powerful and flexible tool for crawling data quickly and efficiently. In this article, we have successfully used Scrapy to crawl the data and rating popularity of Douban movies.

We use Python code and XPath to selectively obtain information on the web page, and use the yield statement to return it to the user. Throughout the process, Scrapy provides a simple and effective way to manage and crawl large amounts of data, allowing us to quickly perform data analysis and processing.

The above is the detailed content of Scrapy in action: crawling Douban movie data and rating popularity rankings. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn