Home >Backend Development >PHP Tutorial >Sharing tips on capturing Douban movie data with PHP and phpSpider!

Sharing tips on capturing Douban movie data with PHP and phpSpider!

WBOY
WBOYOriginal
2023-07-21 11:48:18906browse

Sharing tips on capturing Douban movie data using PHP and phpSpider!

[Introduction]
In the Internet age, with the explosion of information, people need to obtain effective information to meet their needs. As a well-known movie information platform, Douban Movies provides a large amount of movie information and is an indispensable resource for movie lovers. This article will share a technique for using PHP and phpSpider library to capture Douban movie data to help readers quickly obtain the required data.

[Background]
The official API of Douban Movies provides interfaces for querying movies and obtaining movie details, but there are restrictions on frequent access and large-scale data capture. Therefore, we can use phpSpider, a simple and easy-to-use PHP crawler framework, to capture data from Douban movies. phpSpider has functions such as concurrent crawling, automatic deduplication, and web page parsing, and is very suitable for small-scale data crawling.

[Code Implementation]
First, we need to install the phpSpider library in the PHP environment. It can be installed through composer:

composer require phpspider/phpspider

The following is a sample code to capture Douban movie data:

<?php
require 'vendor/autoload.php';
use phpspidercorephpspider;
use phpspidercoreequests;

// 设置要抓取的网页地址
$url = 'https://movie.douban.com/top250';

// 使用phpSpider进行数据抓取
$config = [
    'name' => 'douban_movie',
    'log_show' => false,
    'interval' => 1000,
    'user_agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',
    'domains' => [
        'movie.douban.com'
    ],
    'scan_urls' => [
        $url
    ],
    'content_url_regexes' => [
        'https://movie.douban.com/subject/[0-9]+/'
    ],
    'fields' => [
        [
            'name' => 'title',
            'selector' => '#content h1 span:first',
            'required' => true
        ],
        [
            'name' => 'rating',
            'selector' => '.rating_num',
            'required' => true
        ],
        [
            'name' => 'summary',
            'selector' => '#link-report span[property="v:summary"]',
            'required' => true
        ],
    ]
];

// 在on_extract_page回调函数中处理抓取到的数据
function on_extract_page($page, $data){
    // 将抓取到的数据存储到数据库中或做其他处理
    $title = $data['title'];
    $rating = $data['rating'];
    $summary = $data['summary'];
    // 这里假设将数据存储到数据库中
    $db = new PDO('mysql:host=localhost;dbname=test', 'username', 'password');
    $stmt = $db->prepare('INSERT INTO movie(title, rating, summary) VALUES(?, ?, ?)');
    $stmt->execute([$title, $rating, $summary]);
}

// 启动phpSpider进行抓取
$request = new requests();
$request::$input_encoding = 'utf-8';
$spider = new phpspider($config);
$spider->on_extract_page = 'on_extract_page';
$spider->start();

In the above sample code, we specify the content to be captured by setting configuration information Web page address, data fields, callback functions, etc. Process the captured data in the callback function on_extract_page. The sample code will capture the movie titles, ratings and introductions of the top 250 Douban movies and store the data in the database.

[Summary]
This article introduces the techniques of using PHP and phpSpider library to capture Douban movie data, and gives detailed code examples. Readers only need to make appropriate configurations and modifications according to the examples to achieve the data capture they need. Of course, during the actual crawling process, you also need to pay attention to setting the access frequency appropriately to avoid excessive pressure on the target website. I hope this article can be helpful to readers so that they can more easily obtain the Douban movie data they need.

The above is the detailed content of Sharing tips on capturing Douban movie data with PHP and phpSpider!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn