Home  >  Article  >  Backend Development  >  The solution to curl and file_get_contents grabbing garbled web pages filegetcontents timeout js file get contents wp file get contents

The solution to curl and file_get_contents grabbing garbled web pages filegetcontents timeout js file get contents wp file get contents

WBOY
WBOYOriginal
2016-07-29 08:52:42979browse

Today, when I used the curl_init function to crawl Sohu’s web pages, I found that the collected web pages were garbled. After analysis, I found that the server turned on the gzip compression function. Just add multiple options CURLOPT_ENCODING to the function curl_setopt to parse gzip and you can decode it correctly.
Also, if the captured web page is encoded in GBK, but the script is indeed encoded in utf-8, the captured web page must be converted using the function mb_convert_encoding.
$tmp = sys_get_temp_dir();
$cookieDump = tempnam($tmp, 'cookies');
$url = 'http://tv.sohu.com';
$ch = curl_init() ;
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 1); // Display the returned Header area content
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1); // Use automatic jump
curl_setopt ($ch, CURLOPT_TIMEOUT, 10); // Set timeout limit
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); // The obtained information is returned in the form of a file stream
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 10); // Link timeout limit
curl_setopt ($ch, CURLOPT_HTTPHEADER,array('Accept-Encoding: gzip, deflate'));//Set http header information
curl_setopt ($ch, CURLOPT_ENCODING, 'gzip,deflate');//Add gzip Decoding option, it doesn’t matter even if gzip is not enabled on the web page
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookieDump); // The name of the file that stores cookie information
$content = curl_exec($ch);
// Put the captured web page by Convert GBK to UTF-8
$content = mb_convert_encoding($content,"UTF-8","GBK");
?>
$url = 'http://tv.sohu.com' ;
// Just add the compress.zlib option, even if the server has gzip compression enabled, it can be decoded
$content = file_get_contents("compress.zlib://".$url);
// Get the captured web page Convert from GBK to UTF-8
$content = mb_convert_encoding($content,"UTF-8","GBK");
?>
Original text: http://woqilin.blogspot.com/2014/05/curl- filegetcontents.html

The above introduces the solution to curl and file_get_contents grabbing garbled web pages, including the content of file_get_contents. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn