search
HomeBackend DevelopmentPHP TutorialHow to implement crawler in PHP

How to implement crawler in PHP

Mar 10, 2018 am 11:16 AM
phpaccomplishreptile

Use PHP's curl extension to crawl page data. PHP's curl extension is a library supported by PHP that allows you to connect and communicate with various servers using various types of protocols.

This program captures Zhihu user data. To be able to access the user's personal page, the user needs to be logged in before accessing. When we click a user avatar link on the browser page to enter the user's personal center page, the reason why we can see the user's information is because when we click the link, the browser helps you bring the local cookies and submit them together. Go to a new page, so you can enter the user's personal center page. Therefore, before accessing the personal page, you need to obtain the user's cookie information, and then bring the cookie information with each curl request. In terms of obtaining cookie information, I used my own cookie. You can see your own cookie information on the page:

Copy one by one in the form of "__utma=?;__utmb=?;" Form a cookie string. This cookie string can then be used to send requests.

Initial example:

    $url = 'http://www.zhihu.com/people/mora-hu/about'; 
    //此处mora-hu代表用户ID
    $ch = curl_init($url); 
    //初始化会话
    curl_setopt($ch, CURLOPT_HEADER, 0);    
    curl_setopt($ch, CURLOPT_COOKIE, $this->config_arr['user_cookie']);  
    //设置请求COOKIE
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);    
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
     //将curl_exec()获取的信息以文件流的形式返回,而不是直接输出。
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);     
     $result = curl_exec($ch);    
    return $result;  //抓取的结果

Run the above code to get the personal center page of the mora-hu user. Using this result and then using regular expressions to process the page, you can obtain the name, gender and other information that needs to be captured.

Picture hotlink prevention

When outputting personal information after regularizing the returned results, it was found that the user's avatar cannot be opened when outputting it on the page. After reviewing the information, I found out that it was because Zhihu had protected the pictures from hotlinking. The solution is to forge a referer in the request header when requesting an image.

After using the regular expression to obtain the link to the image, send another request. At this time, bring the source of the image request, indicating that the request is forwarded from the Zhihu website. Specific examples are as follows:

function getImg($url, $u_id){    
    if (file_exists('./images/' . $u_id . ".jpg"))    
    {       
       return "images/$u_id" . '.jpg';    }    if (empty($url))    
    {        
       return ''; 
    }    $context_options = array(         
 'http' =>  
        array(
            'header' => "Referer:http://www.zhihu.com"//带上referer参数 
      )
  );          $context = stream_context_create($context_options);      $img = file_get_contents('http:' . $url, FALSE, $context);    file_put_contents('./images/' . $u_id . ".jpg", $img);    return "images/$u_id" . '.jpg';}

Crawling more users

The URLs of different users are almost the same, the difference lies in the user name. Use regular matching to get the user name list, spell the URLs one by one, and then send requests one by one (of course, one by one is slower, there is a solution below, which will be discussed later). After entering the new user's page, repeat the above steps and continue in this loop until you reach the amount of data you want.

Linux statistics file number

After the script has been running for a while, you need to see how many pictures have been obtained. When the amount of data is relatively large, it is a bit slow to open the folder to view the number of pictures. The script is run in the Linux environment, so you can use the Linux command to count the number of files:

Among them, ls -l is a long list output of file information in the directory (the files here can be directories, links , device files, etc.); grep "^-" filters long list output information, "^-" only retains general files, if only the directory is retained, "^d"; wc -l is the number of lines of statistical output information. The following is a running example:

PHP爬虫 数据抓取 数据分析 爬虫抓取数据

Handling of duplicate data when inserting into MySQL

After the program ran for a period of time, it was found that many users' data was duplicated , so it needs to be processed when inserting duplicate user data. The solution is as follows:

1) Check whether the data already exists in the database before inserting into the database;

2) Add a unique index and use INSERT INTO... ON DUPliCATE KEY UPDATE when inserting. ..

3) Add a unique index, use INSERT INGNO

<br/>

RE INTO...

4) Add a unique index, use REPLACE when inserting INTO...

Use curl_multi to implement I/O multiplexing to capture the page

At the beginning, a single process and a single curl were used to capture data. The speed was very slow. I hung up and crawled. I could only capture 2W of data at night, so I thought about whether I could request multiple users at once when entering a new user page and making a curl request. Later I discovered the good thing curl_multi. Functions such as curl_multi can request multiple URLs at the same time instead of requesting them one by one. This is an I/O multiplexing mechanism. The following is an example of using curl_multi crawler:

        $mh = curl_multi_init(); //返回一个新cURL批处理句柄
        for ($i = 0; $i < $max_size; $i++)
        {            $ch = curl_init();  //初始化单个cURL会话
            curl_setopt($ch, CURLOPT_HEADER, 0);
            curl_setopt($ch, CURLOPT_URL, &#39;http://www.zhihu.com/people/&#39; . $user_list[$i] . &#39;/about&#39;);
            curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
            curl_setopt($ch, CURLOPT_USERAGENT, &#39;Mozilla/5.0 (Windows NT 6.1; WOW64)
            AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36&#39;);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
             curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);            $requestMap[$i] = $ch;
            curl_multi_add_handle($mh, $ch); 
 //向curl批处理会话中添加单独的curl句柄
        }        $user_arr = array();        do {                        //运行当前 cURL 句柄的子连接
            while (($cme = curl_multi_exec($mh, $active)) == CURLM_CALL_MULTI_PERFORM);                        if ($cme != CURLM_OK) {break;}                        //获取当前解析的cURL的相关传输信息
            while ($done = curl_multi_info_read($mh))
            {                $info = curl_getinfo($done[&#39;handle&#39;]);                $tmp_result = curl_multi_getcontent($done[&#39;handle&#39;]);                $error = curl_error($done[&#39;handle&#39;]);                $user_arr[] = array_values(getUserInfo($tmp_result));                //保证同时有$max_size个请求在处理
                if ($i < sizeof($user_list) && isset($user_list[$i]) && $i < count($user_list))
                {                    $ch = curl_init();
                    curl_setopt($ch, CURLOPT_HEADER, 0); 
                   curl_setopt($ch, CURLOPT_URL, &#39;http://www.zhihu.com/people/&#39; . $user_list[$i] . &#39;/about&#39;); 
                   curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
                    curl_setopt($ch, CURLOPT_USERAGENT, &#39;Mozilla/5.0 (Windows NT 6.1; WOW64) 
                   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36&#39;);
                    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
                     curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);                    $requestMap[$i] = $ch; 
                   curl_multi_add_handle($mh, $ch);                    $i++;
                }
                curl_multi_remove_handle($mh, $done[&#39;handle&#39;]);
            }            if ($active) 
               curl_multi_select($mh, 10);
        } while ($active);
        curl_multi_close($mh);        return $user_arr;

HTTP 429 Too Many Requests

Using the curl_multi function can send multiple requests at the same time, but during the execution process, When 200 requests were made, it was found that many requests could not be returned, that is, packet loss was discovered. After further analysis, use the curl_getinfo function to print each request handle information. This function returns an associative array containing HTTP response information. One of the fields is http_code, which represents the HTTP status code returned by the request. I saw that the http_code of many requests was 429. This return code means that too many requests were sent. I guessed that Zhihu had implemented anti-crawler protection, so I tested it on other websites and found that there was no problem when sending 200 requests at once, which proved my guess. Zhihu had implemented protection in this regard. That is, the number of one-time requests is limited. So I kept reducing the number of requests and found that there was no packet loss at 5. It shows that in this program, you can only send up to 5 requests at a time. Although it is not many, it is a small improvement.

使用Redis保存已经访问过的用户

抓取用户的过程中,发现有些用户是已经访问过的,而且他的关注者和关注了的用户都已经获取过了,虽然在数据库的层面做了重复数据的处理,但是程序还是会使用curl发请求,这样重复的发送请求就有很多重复的网络开销。还有一个就是待抓取的用户需要暂时保存在一个地方以便下一次执行,刚开始是放到数组里面,后来发现要在程序里添加多进程,在多进程编程里,子进程会共享程序代码、函数库,但是进程使用的变量与其他进程所使用的截然不同。不同进程之间的变量是分离的,不能被其他进程读取,所以是不能使用数组的。因此就想到了使用Redis缓存来保存已经处理好的用户以及待抓取的用户。这样每次执行完的时候都把用户push到一个already_request_queue队列中,把待抓取的用户(即每个用户的关注者和关注了的用户列表)push到request_queue里面,然后每次执行前都从request_queue里pop一个用户,然后判断是否在already_request_queue里面,如果在,则进行下一个,否则就继续执行。

在PHP中使用redis示例:

<?php    $redis = new Redis();    $redis->connect(&#39;127.0.0.1&#39;, &#39;6379&#39;);    $redis->set(&#39;tmp&#39;, &#39;value&#39;);    if ($redis->exists(&#39;tmp&#39;))
    {        echo $redis->get(&#39;tmp&#39;) . "\n";
    }

使用PHP的pcntl扩展实现多进程

改用了curl_multi函数实现多线程抓取用户信息之后,程序运行了一个晚上,最终得到的数据有10W。还不能达到自己的理想目标,于是便继续优化,后来发现php里面有一个pcntl扩展可以实现多进程编程。下面是多编程编程的示例:

    //PHP多进程demo    //fork10个进程
    for ($i = 0; $i < 10; $i++) {        $pid = pcntl_fork();        if ($pid == -1) {            echo "Could not fork!\n";            exit(1); 
       }        if (!$pid) {            echo "child process $i running\n";            //子进程执行完毕之后就退出,以免继续fork出新的子进程
            exit($i);
        }
    }        //等待子进程执行完毕,避免出现僵尸进程
    while (pcntl_waitpid(0, $status) != -1) {        $status = pcntl_wexitstatus($status); 
       echo "Child $status completed\n";
    }

在linux下查看系统的cpu信息

实现了多进程编程之后,就想着多开几条进程不断地抓取用户的数据,后来开了8调进程跑了一个晚上后发现只能拿到20W的数据,没有多大的提升。于是查阅资料发现,根据系统优化的CPU性能调优,程序的最大进程数不能随便给的,要根据CPU的核数和来给,最大进程数最好是cpu核数的2倍。因此需要查看cpu的信息来看看cpu的核数。在linux下查看cpu的信息的命令:

PHP爬虫 数据抓取 数据分析 爬虫抓取数据

其中,model name表示cpu类型信息,cpu cores表示cpu核数。这里的核数是1,因为是在虚拟机下运行,分配到的cpu核数比较少,因此只能开2条进程。最终的结果是,用了一个周末就抓取了110万的用户数据。

多进程编程中Redis和MySQL连接问题

在多进程条件下,程序运行了一段时间后,发现数据不能插入到数据库,会报mysql too many connections的错误,redis也是如此。

下面这段代码会执行失败:

<?php     for ($i = 0; $i < 10; $i++) {          $pid = pcntl_fork();          if ($pid == -1) {               echo "Could not fork!\n";               exit(1);
          }          if (!$pid) {               $redis = PRedis::getInstance();               // do something     
               exit;
          }
     }

根本原因是在各个子进程创建时,就已经继承了父进程一份完全一样的拷贝。对象可以拷贝,但是已创建的连接不能被拷贝成多个,由此产生的结果,就是各个进程都使用同一个redis连接,各干各的事,最终产生莫名其妙的冲突。

解决方法:

程序不能完全保证在fork进程之前,父进程不会创建redis连接实例。因此,要解决这个问题只能靠子进程本身了。试想一下,如果在子进程中获取的实例只与当前进程相关,那么这个问题就不存在了。于是解决方案就是稍微改造一下redis类实例化的静态方式,与当前进程ID绑定起来。

改造后的代码如下:

<?php     public static function getInstance() {          static $instances = array();          $key = getmypid();//获取当前进程ID
          if ($empty($instances[$key])) {               $inctances[$key] = new self();
          }               return $instances[$key];
     }

PHP统计脚本执行时间

因为想知道每个进程花费的时间是多少,因此写个函数统计脚本执行时间:

function microtime_float()
{     list($u_sec, $sec) = explode(&#39; &#39;, microtime()); 
     return (floatval($u_sec) + floatval($sec));
}$start_time = microtime_float();

 //do somethingusleep(100);$end_time = microtime_float();$total_time = $end_time - $start_time;$time_cost = sprintf("%.10f", $total_time);echo "program cost total " . $time_cost . "s\n";

若文中有不正确的地方,望各位指出以便改正。

相关推荐:

nodejs爬虫superagent和cheerio体验案例

NodeJS爬虫详解

Node.js爬虫之网页请求模块详解

The above is the detailed content of How to implement crawler in PHP. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
PHP: An Introduction to the Server-Side Scripting LanguagePHP: An Introduction to the Server-Side Scripting LanguageApr 16, 2025 am 12:18 AM

PHP is a server-side scripting language used for dynamic web development and server-side applications. 1.PHP is an interpreted language that does not require compilation and is suitable for rapid development. 2. PHP code is embedded in HTML, making it easy to develop web pages. 3. PHP processes server-side logic, generates HTML output, and supports user interaction and data processing. 4. PHP can interact with the database, process form submission, and execute server-side tasks.

PHP and the Web: Exploring its Long-Term ImpactPHP and the Web: Exploring its Long-Term ImpactApr 16, 2025 am 12:17 AM

PHP has shaped the network over the past few decades and will continue to play an important role in web development. 1) PHP originated in 1994 and has become the first choice for developers due to its ease of use and seamless integration with MySQL. 2) Its core functions include generating dynamic content and integrating with the database, allowing the website to be updated in real time and displayed in personalized manner. 3) The wide application and ecosystem of PHP have driven its long-term impact, but it also faces version updates and security challenges. 4) Performance improvements in recent years, such as the release of PHP7, enable it to compete with modern languages. 5) In the future, PHP needs to deal with new challenges such as containerization and microservices, but its flexibility and active community make it adaptable.

Why Use PHP? Advantages and Benefits ExplainedWhy Use PHP? Advantages and Benefits ExplainedApr 16, 2025 am 12:16 AM

The core benefits of PHP include ease of learning, strong web development support, rich libraries and frameworks, high performance and scalability, cross-platform compatibility, and cost-effectiveness. 1) Easy to learn and use, suitable for beginners; 2) Good integration with web servers and supports multiple databases; 3) Have powerful frameworks such as Laravel; 4) High performance can be achieved through optimization; 5) Support multiple operating systems; 6) Open source to reduce development costs.

Debunking the Myths: Is PHP Really a Dead Language?Debunking the Myths: Is PHP Really a Dead Language?Apr 16, 2025 am 12:15 AM

PHP is not dead. 1) The PHP community actively solves performance and security issues, and PHP7.x improves performance. 2) PHP is suitable for modern web development and is widely used in large websites. 3) PHP is easy to learn and the server performs well, but the type system is not as strict as static languages. 4) PHP is still important in the fields of content management and e-commerce, and the ecosystem continues to evolve. 5) Optimize performance through OPcache and APC, and use OOP and design patterns to improve code quality.

The PHP vs. Python Debate: Which is Better?The PHP vs. Python Debate: Which is Better?Apr 16, 2025 am 12:03 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on the project requirements. 1) PHP is suitable for web development, easy to learn, rich community resources, but the syntax is not modern enough, and performance and security need to be paid attention to. 2) Python is suitable for data science and machine learning, with concise syntax and easy to learn, but there are bottlenecks in execution speed and memory management.

PHP's Purpose: Building Dynamic WebsitesPHP's Purpose: Building Dynamic WebsitesApr 15, 2025 am 12:18 AM

PHP is used to build dynamic websites, and its core functions include: 1. Generate dynamic content and generate web pages in real time by connecting with the database; 2. Process user interaction and form submissions, verify inputs and respond to operations; 3. Manage sessions and user authentication to provide a personalized experience; 4. Optimize performance and follow best practices to improve website efficiency and security.

PHP: Handling Databases and Server-Side LogicPHP: Handling Databases and Server-Side LogicApr 15, 2025 am 12:15 AM

PHP uses MySQLi and PDO extensions to interact in database operations and server-side logic processing, and processes server-side logic through functions such as session management. 1) Use MySQLi or PDO to connect to the database and execute SQL queries. 2) Handle HTTP requests and user status through session management and other functions. 3) Use transactions to ensure the atomicity of database operations. 4) Prevent SQL injection, use exception handling and closing connections for debugging. 5) Optimize performance through indexing and cache, write highly readable code and perform error handling.

How do you prevent SQL Injection in PHP? (Prepared statements, PDO)How do you prevent SQL Injection in PHP? (Prepared statements, PDO)Apr 15, 2025 am 12:15 AM

Using preprocessing statements and PDO in PHP can effectively prevent SQL injection attacks. 1) Use PDO to connect to the database and set the error mode. 2) Create preprocessing statements through the prepare method and pass data using placeholders and execute methods. 3) Process query results and ensure the security and performance of the code.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.