Home  >  Article  >  Backend Development  >  PHP multi-threaded web page crawling implementation code_PHP tutorial

PHP multi-threaded web page crawling implementation code_PHP tutorial

WBOY
WBOYOriginal
2016-07-21 15:35:58788browse

Limited by the fact that the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, it is often necessary to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can Curl Multi Functions be used to write concurrent multi-threaded file downloads? Of course, my code is given below:

Code 1: Write the obtained code directly into a certain File

Copy code The code is as follows:

$urls = array(
'http ://www.sina.com.cn/',
'http://www.sohu.com/',
'http://www.163.com/'
); / / Set the page URL to be crawled

$save_to='/test.txt'; // Write the crawled code into the file

$st = fopen($save_to," a");
$mh = curl_multi_init();

foreach ($urls as $i => $url) {
$conn[$i] = curl_init($url);
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");
curl_setopt($conn[$i], CURLOPT_HEADER ,0);
curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);
curl_setopt($conn[$i], CURLOPT_FILE,$st); // Set to write the crawled code to the file
curl_multi_add_handle ($ mh,$conn[$i]);
} // Initialization

do {
curl_multi_exec($mh,$active);
} while ($active); // Execute

foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh,$conn[$i]);
curl_close($conn[$i]);
} // End cleanup

curl_multi_close($mh);
fclose($st);
?>

Code 2: The code that will be obtained First put the variables, then write to a file
Copy the code The code is as follows:

$urls = array(
'http://www.sina.com.cn/',
'http://www.sohu.com/',
'http://www.163 .com/'
);

$save_to='/test.txt'; // Write the captured code into the file
$st = fopen($save_to,"a" );

$mh = curl_multi_init();
foreach ($urls as $i => $url) {
$conn[$i] = curl_init($url);
curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");
curl_setopt($conn[$i], CURLOPT_HEADER ,0);
curl_setopt ($conn[$i], CURLOPT_CONNECTTIMEOUT,60);
curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // Set the crawling code not to be written to the browser, but converted to a string
curl_multi_add_handle ($mh,$conn[$i]);
}

do {
curl_multi_exec($mh,$active);
} while ($active);

foreach ($urls as $i => $url) {
$data = curl_multi_getcontent($conn[$i]); // Get the crawled code string
fwrite($ st,$data); //Write string to file. Of course, it is also possible not to write to a file, such as storing it in a database
} // Obtain data variables and write to the file

foreach ($urls as $i => $url) {
curl_multi_remove_handle($mh,$conn[$i]);
curl_close($conn[$i]);
}

curl_multi_close($mh);
fclose($st) ;
?>

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/322228.htmlTechArticleLimited by the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, it is often necessary With Curl Multi Functions, it can achieve concurrent multi-threaded access to multiple...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn