Home  >  Article  >  Backend Development  >  PHP uses CURL to implement multi-threaded web crawling_PHP tutorial

PHP uses CURL to implement multi-threaded web crawling_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 09:55:01934browse

PHP uses CURL to implement multi-threaded web crawling

PHP uses Curl Functions to complete various file transfer operations, such as simulating the browser to send GET, POST requests, etc., limited Since the PHP language itself does not support multi-threading, the efficiency of developing crawler programs is not high. At this time, you often need to use Curl Multi Functions, which can achieve concurrent multi-threaded access to multiple URL addresses. Since Curl Multi Function is so powerful, can you use Curl Multi Functions to write concurrent multi-threaded file downloads? Of course you can. My code is given below:

Code 1: Write the obtained code directly into a file

 ?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

$urls = array(

'http://www.sina.com.cn/',

'http://www.sohu.com/',

'http://www.163.com/'

); // 设置要抓取的页面URL

$save_to='/test.txt'; // 把抓取的代码写入该文件

$st = fopen($save_to,"a");

$mh = curl_multi_init();

foreach ($urls as $i => $url) {

$conn[$i] = curl_init($url);

curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");

curl_setopt($conn[$i], CURLOPT_HEADER ,0);

curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);

curl_setopt($conn[$i], CURLOPT_FILE,$st); // 设置将爬取的代码写入文件

curl_multi_add_handle ($mh,$conn[$i]);

} // 初始化

 

do {

curl_multi_exec($mh,$active);

} while ($active); // 执行

 

foreach ($urls as $i => $url) {

curl_multi_remove_handle($mh,$conn[$i]);

curl_close($conn[$i]);

} // 结束清理

 

curl_multi_close($mh);

fclose($st);

?>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
<🎜>$urls = array(<🎜> <🎜>'http://www.sina.com.cn/',<🎜> <🎜>'http://www.sohu.com/',<🎜> <🎜>'http://www.163.com/'<🎜> <🎜>); // Set the page URL to be crawled<🎜> <🎜> <🎜> <🎜>$save_to='/test.txt'; //Write the captured code into this file<🎜> <🎜> <🎜> <🎜>$st = fopen($save_to,"a");<🎜> <🎜>$mh = curl_multi_init();<🎜> <🎜> <🎜> <🎜>foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i], CURLOPT_FILE,$st); // Set the crawled code to be written to the file curl_multi_add_handle ($mh,$conn[$i]); } // Initialization do { curl_multi_exec($mh,$active); } while ($active); // Execute foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } // End cleanup curl_multi_close($mh); fclose($st); ?>

Code 2: Put the obtained code into a variable first, and then write it to a file

 ?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

$urls = array(

'http://www.sina.com.cn/',

'http://www.sohu.com/',

'http://www.163.com/'

);

$save_to='/test.txt'; // 把抓取的代码写入该文件

$st = fopen($save_to,"a");

$mh = curl_multi_init();

foreach ($urls as $i => $url) {

$conn[$i] = curl_init($url);

curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)");

curl_setopt($conn[$i], CURLOPT_HEADER ,0);

curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60);

curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // 设置不将爬取代码写到浏览器,而是转化为字符串

curl_multi_add_handle ($mh,$conn[$i]);

}

 

do {

curl_multi_exec($mh,$active);

} while ($active);

 

foreach ($urls as $i => $url) {

$data = curl_multi_getcontent($conn[$i]); // 获得爬取的代码字符串

fwrite($st,$data); // 将字符串写入文件。当然,也可以不写入文件,比如存入数据库

} // 获得数据变量,并写入文件

 

foreach ($urls as $i => $url) {

curl_multi_remove_handle($mh,$conn[$i]);

curl_close($conn[$i]);

}

 

curl_multi_close($mh);

fclose($st);

?>

1

2 3

45 6 7 8 9 10
11
12
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
<🎜>$urls = array(<🎜> <🎜>'http://www.sina.com.cn/',<🎜> <🎜>'http://www.sohu.com/',<🎜> <🎜>'http://www.163.com/'<🎜> <🎜>);<🎜> <🎜> <🎜> <🎜>$save_to='/test.txt'; //Write the captured code into this file<🎜> <🎜>$st = fopen($save_to,"a");<🎜> <🎜> <🎜> <🎜>$mh = curl_multi_init();<🎜> <🎜>foreach ($urls as $i => $url) { $conn[$i] = curl_init($url); curl_setopt($conn[$i], CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"); curl_setopt($conn[$i], CURLOPT_HEADER ,0); curl_setopt($conn[$i], CURLOPT_CONNECTTIMEOUT,60); curl_setopt($conn[$i],CURLOPT_RETURNTRANSFER,true); // Set the crawling code not to be written to the browser, but converted to a string curl_multi_add_handle ($mh,$conn[$i]); } do { curl_multi_exec($mh,$active); } while ($active); foreach ($urls as $i => $url) { $data = curl_multi_getcontent($conn[$i]); // Get the crawled code string fwrite($st,$data); //Write string to file. Of course, it is also possible not to write to a file, such as saving it to a database } // Get the data variable and write it to the file foreach ($urls as $i => $url) { curl_multi_remove_handle($mh,$conn[$i]); curl_close($conn[$i]); } curl_multi_close($mh); fclose($st); ?>
The above is the entire content of this article, I hope you all like it. http://www.bkjia.com/PHPjc/993432.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/993432.htmlTechArticlePHP uses CURL to implement multi-threaded web crawling PHP uses Curl Functions to complete various file transfer operations, such as simulated browsing The server sends GET, POST requests, etc., limited by the php language itself...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn