Home > Article > Backend Development > PHP uses CURL to implement multi-threaded web crawling_PHP tutorial
PHP uses Curl Functions to complete various file transfer operations, such as simulating the browser to send GET, POST requests, etc., limited Since the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, you often need to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can you use Curl Multi Functions to write concurrent multi-threaded file downloads? Of course you can. My code is given below:
Code 1: Write the obtained code directly into a file
?
|
<🎜>$urls = array(<🎜> <🎜>'http://www.sina.com.cn/',<🎜> <🎜>'http://www.sohu.com/',<🎜> <🎜>'http://www.163.com/'<🎜> <🎜>); // Set the page URL to be crawled<🎜> <🎜> <🎜> <🎜>$save_to='/test.txt'; //Write the captured code into this file<🎜> <🎜> <🎜> <🎜>$st = fopen($save_to,"a");<🎜> <🎜>$mh = curl_multi_init();<🎜> <🎜> <🎜> <🎜>foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i], CURLOPT_FILE,$st); // Set the crawled code to be written to the file curl_multi_add_handle ($mh,$conn[$i]); } // Initialization do { curl_multi_exec($mh,$active); } while ($active); // Execute foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } // End cleanup curl_multi_close($mh); fclose($st); ?> |
Code 2: Put the obtained code into a variable first, and then write it to a file
?
2 3 11 12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
<🎜>$urls = array(<🎜> <🎜>'http://www.sina.com.cn/',<🎜> <🎜>'http://www.sohu.com/',<🎜> <🎜>'http://www.163.com/'<🎜> <🎜>);<🎜> <🎜> <🎜> <🎜>$save_to='/test.txt'; //Write the captured code into this file<🎜> <🎜>$st = fopen($save_to,"a");<🎜> <🎜> <🎜> <🎜>$mh = curl_multi_init();<🎜> <🎜>foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // Set the crawling code not to be written to the browser, but converted to a string curl_multi_add_handle ($mh,$conn[$i]); } do { curl_multi_exec($mh,$active); } while ($active); foreach ($urls as $i => $url) { $data = curl_multi_getcontent($conn[$i]); // Get the crawled code string fwrite($st,$data); //Write string to file. Of course, it is also possible not to write to a file, such as saving it to a database } // Get the data variable and write it to the file foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } curl_multi_close($mh); fclose($st); ?> |