Home  >  Article  >  Backend Development  >  How to grab BT Paradise movie data

How to grab BT Paradise movie data

WBOY
WBOYOriginal
2016-07-30 13:30:523045browse

I had a rest at night and wanted to watch two good movies.

I searched for a long time but couldn’t find what I wanted to watchHow to grab BT Paradise movie data.

I suddenly thought that someone had crawled Zhihu’s user data before. I had a whimHow to grab BT Paradise movie data,

It’s okay to crawl down the movie information of BT Paradise,I can check the database directly next time. How to grab BT Paradise movie dataHow to grab BT Paradise movie data

I can only say that I am so bored How to grab BT Paradise movie data, haha, I can still code ^_^


1. Grab the website html source code

<span style="font-size:24px;">$url = "www.bttiantang.cc";
$html = shell_exec("curl $url");</span>

2. Get the total number of pages, Total number of movies (regular matching)

<span style="font-size:24px;">preg_match("/<span class=\"pageinfo\">.*?<\/span>/", $html, $pageCount);
preg_match_all("/\d{1,10000}/",$pageCount[0],$pageCount);</span>

3. Capture movie information (regular matching information)

<span style="font-size:24px;">preg_match("/\d{4}\/\d{2}\/\d{2}/" , $pageInfo[0][$i], $updateTime);

preg_match("/<font color=\"#FF6600\">(.*?)<i>/" , $pageInfo[0][$i], $movieName);
        
preg_match("/<strong>(\d{1})<\/strong>/" , $pageInfo[0][$i], $movieScore_int);
     
preg_match("/<em class=\"fm\">(\d{1})<\/em>/" , $pageInfo[0][$i], $movieScore_decimal);
        
preg_match("/href=\"(.*?)\"/" , $pageInfo[0][$i], $movieUrl);
      
preg_match("/<p class=\"des\">(.*?)<\/p>/" , $pageInfo[0][$i], $actor);
       </span>

4. Insert into the database and you’re done

Generally speaking, the speed of php crawling is quite fast. It takes less than 4 minutes to collect more than 20,000 pieces of information.

start:01:22:54

end:01:26:11



Attached database screenshot:



Attached source code:

<?php

$url = "www.bttiantang.cc";
$html = shell_exec("curl $url");

preg_match("/<span class=\"pageinfo\">.*?<\/span>/", $html, $pageCount);
preg_match_all("/\d{1,10000}/",$pageCount[0],$pageCount);

$pageSize = intval($pageCount[0][0]);
$movieCount = $pageCount[0][1];

$conn = mysql_connect('***','***','');
mysql_select_db('***',$conn);
mysql_query('set names utf8',$conn);

for($j=1;$j<=$pageSize;$j++){
    $movieHtml = shell_exec("curl $url?PageNo=$j");
    preg_match_all("/<div class=\"item cl\">.*?<\/div>/s", $movieHtml, $pageInfo);
    for($i=0;$i<count($pageInfo[0]);$i++){
        preg_match("/\d{4}\/\d{2}\/\d{2}/" , $pageInfo[0][$i], $updateTime);
        /******clear ad*****/
            if(empty($updateTime))continue;
        /*******************/
        $updateTime = str_replace('/','-',$updateTime[0]);


        preg_match("/<font color=\"#FF6600\">(.*?)<i>/" , $pageInfo[0][$i], $movieName);
        /*****same conditions*****/
        if(empty($movieName))
            preg_match("/<b>(.*?)<i>/" , $pageInfo[0][$i], $movieName);
        if(empty($movieName))
            preg_match("/<b>(.*?)<\/b>/" , $pageInfo[0][$i], $movieName);
        /************************/
        $movieName = $movieName[1];

        preg_match("/<strong>(\d{1})<\/strong>/" , $pageInfo[0][$i], $movieScore_int);
        $movieScore_int = $movieScore_int[1];
        preg_match("/<em class=\"fm\">(\d{1})<\/em>/" , $pageInfo[0][$i], $movieScore_decimal);
        $movieScore_decimal = $movieScore_decimal[1];
        $movieScore = floatval($movieScore_int.'.'.$movieScore_decimal);

        preg_match("/href=\"(.*?)\"/" , $pageInfo[0][$i], $movieUrl);
        $movieUrl = $movieUrl[1];

        preg_match("/<p class=\"des\">(.*?)<\/p>/" , $pageInfo[0][$i], $actor);
        $movieActor = str_replace("<em>",'',str_replace("</em>",'',$actor[1]));

        mysql_unbuffered_query("insert into movie (name,actor,url,update_ts,score) values ('$movieName','$movieActor','$movieUrl',<span style="white-space:pre">	</span>'$updateTime','$movieScore')");
    }

}

?>

This movie information is grabbed from BT Paradise and does not involve confidential information. Therefore, I do not bear any legal responsibility!

If any relevant movie information involves your copyright or intellectual property rights or other interests, please inform us and it will be deleted as soon as possible after confirmation.

Copyright Statement: This article is an original article by the blogger and may not be reproduced without the blogger's permission.

The above introduces how to crawl BT Paradise movie data, including aspects of content. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn