Home  >  Article  >  Backend Development  >  PHP multi-threaded web page crawling example code

PHP multi-threaded web page crawling example code

怪我咯
怪我咯Original
2017-07-11 14:47:321153browse

Multi-threading (English: multithreading) refers to the technology that realizes the concurrent execution of multiple threads from software or hardware. Computers with multi-threading capabilities have hardware support that allows them to execute more than one thread at the same time, thereby improving overall processing performance. Systems with this capability include symmetric multiprocessors, multicore processors, and chip-level multithreading or simultaneous multithreading processors. [1] In a program, these independently running program fragments are called "Threads", and the concept of using them to program is called "Multithreading". Computers with multi-threading capabilities can execute more than one thread (translated as "thread" in Taiwan) at the same time due to hardware support, thus improving overall processing performance.

PHP can use Curl Functions to complete various file transfer operations, such as simulating a browser to send GET, POST request, etc.

Limited by the fact that the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, it is often necessary to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can Curl Multi Functions be used to write concurrent multi-threaded file downloads? Of course, my code is given below:
Code 1: Write the obtained code directly into a file

The code is as follows:

<?php 
$urls = array( 
&#39;http://www.php.cn/&#39;, 
&#39;http://www.baidu.com/&#39;, 
&#39;http://www.163.com/&#39; 
); // 设置要抓取的页面URL 

$save_to=&#39;/test.txt&#39;; // 把抓取的代码写入该文件 

$st = fopen($save_to,"a"); 
$mh = curl_multi_init(); 

foreach ($urls as $i => $url) { 
$conn[$i] = curl_init($url); 
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); 
curl_setopt($conn[$i], CURLOPT_HEADER ,0); 
curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); 
curl_setopt($conn[$i], CURLOPT_FILE,$st); // 设置将爬取的代码写入文件 
curl_multi_add_handle ($mh,$conn[$i]); 
} // 初始化 

do { 
curl_multi_exec($mh,$active); 
} while ($active); // 执行 

foreach ($urls as $i => $url) { 
curl_multi_remove_handle($mh,$conn[$i]); 
curl_close($conn[$i]); 
} // 结束清理 

curl_multi_close($mh); 
fclose($st); 
?>

Code 2: Put the obtained code into the variable, and then write it to a file

The code is as follows:

<?php 
$urls = array( 
&#39;http://www.php.cn/&#39;, 
&#39;http://www.baidu.com/&#39;, 
&#39;http://www.163.com/&#39; 
); 

$save_to=&#39;/test.txt&#39;; // 把抓取的代码写入该文件 
$st = fopen($save_to,"a"); 

$mh = curl_multi_init(); 
foreach ($urls as $i => $url) { 
$conn[$i] = curl_init($url); 
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); 
curl_setopt($conn[$i], CURLOPT_HEADER ,0); 
curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); 
curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // 设置不将爬取代码写到浏览器,而是转化为字符串 
curl_multi_add_handle ($mh,$conn[$i]); 
} 

do { 
curl_multi_exec($mh,$active); 
} while ($active); 

foreach ($urls as $i => $url) { 
$data = curl_multi_getcontent($conn[$i]); // 获得爬取的代码字符串 
fwrite($st,$data); // 将字符串写入文件。当然,也可以不写入文件,比如存入数据库 
} // 获得数据变量,并写入文件 

foreach ($urls as $i => $url) { 
curl_multi_remove_handle($mh,$conn[$i]); 
curl_close($conn[$i]); 
} 

curl_multi_close($mh); 
fclose($st); 
?>

The above is the detailed content of PHP multi-threaded web page crawling example code. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn