Home > Article > Backend Development > The solution to curl and file_get_contents grabbing garbled web pages filegetcontents timeout js file get contents wp file get contents
Today, when I used the curl_init function to crawl Sohu’s web pages, I found that the collected web pages were garbled. After analysis, I found that the server turned on the gzip compression function. Just add multiple options CURLOPT_ENCODING to the function curl_setopt to parse gzip and you can decode it correctly.
Also, if the captured web page is encoded in GBK, but the script is indeed encoded in utf-8, the captured web page must be converted using the function mb_convert_encoding.
$tmp = sys_get_temp_dir();
$cookieDump = tempnam($tmp, 'cookies');
$url = 'http://tv.sohu.com';
$ch = curl_init() ;
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 1); // Display the returned Header area content
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1); // Use automatic jump
curl_setopt ($ch, CURLOPT_TIMEOUT, 10); // Set timeout limit
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); // The obtained information is returned in the form of a file stream
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, 10); // Link timeout limit
curl_setopt ($ch, CURLOPT_HTTPHEADER,array('Accept-Encoding: gzip, deflate'));//Set http header information
curl_setopt ($ch, CURLOPT_ENCODING, 'gzip,deflate');//Add gzip Decoding option, it doesn’t matter even if gzip is not enabled on the web page
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookieDump); // The name of the file that stores cookie information
$content = curl_exec($ch);
// Put the captured web page by Convert GBK to UTF-8
$content = mb_convert_encoding($content,"UTF-8","GBK");
?>
$url = 'http://tv.sohu.com' ;
// Just add the compress.zlib option, even if the server has gzip compression enabled, it can be decoded
$content = file_get_contents("compress.zlib://".$url);
// Get the captured web page Convert from GBK to UTF-8
$content = mb_convert_encoding($content,"UTF-8","GBK");
?>
Original text: http://woqilin.blogspot.com/2014/05/curl- filegetcontents.html
The above introduces the solution to curl and file_get_contents grabbing garbled web pages, including the content of file_get_contents. I hope it will be helpful to friends who are interested in PHP tutorials.