Home >Backend Development >PHP Tutorial >Use of single page collection function get_html based on curl data collection_PHP tutorial
This is a series and I can’t finish it in one or two days, so I will publish it one by one
General outline:
1.curl data collection series single page collection function get_html
2.curl data collection series multi-page parallel collection function get_htmls
3.curl data collection series regular processing function get _matches
4.Curl data collection series code separation
5. Curl data collection series parallel logic control function web_spider
Single page collection is the most commonly used function in the data collection process. Sometimes this collection method can only be used under server access restrictions. It is slow but can be easily controlled, so write a commonly used curl function call. It’s very important
We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples
The simplest way to write:
You can see http_code 302 Redirected. At this time, you need to pass some parameters:
$url = 'http://www.163.com';
$options[CURLOPT_FOLLOWLOCATION] = true;
echo get_html($url,$options);
You will find out why such a page is different from the one accessed by our computer? ? ?
It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version
It seems that USERAGENT
$url = 'http: //www.163.com';
$options[CURLOPT_FOLLOWLOCATION] = true;
$options[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0';
echo get_html($url,$options);
OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions
Of course there are other ways to achieve this. When you clearly know the NetEase webpage, you can simply collect it: