Home  >  Article  >  Backend Development  >  PHP uses CURL to implement multi-threading to crawl web pages or download files

PHP uses CURL to implement multi-threading to crawl web pages or download files

墨辰丷
墨辰丷Original
2018-06-11 11:23:401968browse

PHP can use Curl to complete various file transfer operations, such as simulating a browser to send GET, POST requests, etc. However, because the PHP language itself does not support multi-threading, developing crawler programs is not efficient, but you can use Curl. With the help of Curl function, concurrent multi-threads can access multiple URL addresses to achieve concurrent multi-thread crawling of web pages or downloading files.

PHP can use Curl Functions to complete various file transfer operations, such as simulating a browser to send GET, POST requests, etc. are limited by the fact that the PHP language itself does not support multi-threading, so the efficiency of developing crawler programs is not high. At this time, you often need to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can Curl Multi Functions be used to write concurrent multi-threaded file downloads? Of course, my code is given below:

Code 1: Write the obtained code directly into a File

<?php 
$urls = array(  
 &#39;http://www.sina.com.cn/&#39;,  
 &#39;http://www.sohu.com/&#39;,  
 &#39;http://www.163.com/&#39; 
); // 设置要抓取的页面URL  
   
$save_to=&#39;/test.txt&#39;;  // 把抓取的代码写入该文件   
  
$st = fopen($save_to,"a");  
$mh = curl_multi_init();   
  
foreach ($urls as $i => $url) {  
 $conn[$i] = curl_init($url);  
 curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");  
 curl_setopt($conn[$i], CURLOPT_HEADER ,0);  
 curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);  
 curl_setopt($conn[$i], CURLOPT_FILE,$st); // 设置将爬取的代码写入文件  
 curl_multi_add_handle ($mh,$conn[$i]);  
} // 初始化  
   
do {  
 curl_multi_exec($mh,$active);  
} while ($active); // 执行  
   
foreach ($urls as $i => $url) {  
 curl_multi_remove_handle($mh,$conn[$i]);  
 curl_close($conn[$i]);  
} // 结束清理  
   
curl_multi_close($mh);  
fclose($st); 
?>

Code 2: Put the obtained code into a variable first, and then write it to a file

<?php 
$urls = array(  
 &#39;http://www.sina.com.cn/&#39;,  
 &#39;http://www.sohu.com/&#39;,  
 &#39;http://www.163.com/&#39; 
);  
  
$save_to=&#39;/test.txt&#39;;  // 把抓取的代码写入该文件  
$st = fopen($save_to,"a");  
  
$mh = curl_multi_init();  
foreach ($urls as $i => $url) {  
 $conn[$i] = curl_init($url);  
 curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");  
 curl_setopt($conn[$i], CURLOPT_HEADER ,0);  
 curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);  
 curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // 设置不将爬取代码写到浏览器,而是转化为字符串  
 curl_multi_add_handle ($mh,$conn[$i]);  
}  
  
do {  
 curl_multi_exec($mh,$active);  
} while ($active);  
   
foreach ($urls as $i => $url) {  
 $data = curl_multi_getcontent($conn[$i]); // 获得爬取的代码字符串  
 fwrite($st,$data); // 将字符串写入文件。当然,也可以不写入文件,比如存入数据库  
} // 获得数据变量,并写入文件  
  
foreach ($urls as $i => $url) {  
 curl_multi_remove_handle($mh,$conn[$i]);  
 curl_close($conn[$i]);  
}  
  
curl_multi_close($mh);  
fclose($st);  
?>

Summary: The above is the entire content of this article, I hope it will be helpful to everyone's study.

Related recommendations:

PHP implements simple online reading of PDF files

Common PHP exception handling methods

Two methods of PHP array fusion

The above is the detailed content of PHP uses CURL to implement multi-threading to crawl web pages or download files. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn