Home  >  Article  >  Backend Development  >  Web crawling - php uses curl to crawl web pages

Web crawling - php uses curl to crawl web pages

WBOY
WBOYOriginal
2016-09-23 11:31:061137browse

<code>封装了一个curl抓取网页的函数,在本地测没问题;放到测试服务器上后,如果通过浏览器访问
执行,大部分时候函数返回的HTTP状态码返回0,错误信息`Error:name lookup timed 
out`,极其偶像的情况下返回200成功;但如果直接在测试服务器上直接用命令行执行,100%成功。

代码如下:</code>
<code>static public function curlGet($url, $data = array(), $header = array(), $timeout = 3, $port = 80)
    {
        $is_ssl  = substr($url, 0, 5) == 'https' ? 1 : 0;

        $ch = curl_init();
        if (!empty($data)) {
            $data = is_array($data)?http_build_query($data): $data;
            $url .= (strpos($url,'?')?  '&': "?") . $data;
        }

        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); //是否抓取跳转后的页面
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
        curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
        curl_setopt($ch, CURLOPT_POST, 0);
        //curl_setopt($ch, CURLOPT_PORT, $port);
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
        curl_setopt($ch, CURLOPT_REFERER, $url);
        // curl_setopt($ch, CURLOPT_USERAGENT, self::url2useragent($url));
        
        if($is_ssl){
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // 跳过证书检查
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, true);  // 从证书中检查SSL加密算法是否存在
        }
        $result = array();
        $result['result'] = curl_exec($ch);
        $result['http_code'] = curl_getinfo($ch,CURLINFO_HTTP_CODE);
        if (0 != curl_errno($ch)) {
            $result['error']  = "Error:\n" . curl_error($ch);
        }
        curl_close($ch);
        return $result;
    }</code>
<code>我个人感觉和代码应该关系不大,不知是哪的问题。望各路大神不吝赐教,指点迷津,不胜
感激。</code>

Reply content:

<code>封装了一个curl抓取网页的函数,在本地测没问题;放到测试服务器上后,如果通过浏览器访问
执行,大部分时候函数返回的HTTP状态码返回0,错误信息`Error:name lookup timed 
out`,极其偶像的情况下返回200成功;但如果直接在测试服务器上直接用命令行执行,100%成功。

代码如下:</code>
<code>static public function curlGet($url, $data = array(), $header = array(), $timeout = 3, $port = 80)
    {
        $is_ssl  = substr($url, 0, 5) == 'https' ? 1 : 0;

        $ch = curl_init();
        if (!empty($data)) {
            $data = is_array($data)?http_build_query($data): $data;
            $url .= (strpos($url,'?')?  '&': "?") . $data;
        }

        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); //是否抓取跳转后的页面
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
        curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
        curl_setopt($ch, CURLOPT_POST, 0);
        //curl_setopt($ch, CURLOPT_PORT, $port);
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
        curl_setopt($ch, CURLOPT_REFERER, $url);
        // curl_setopt($ch, CURLOPT_USERAGENT, self::url2useragent($url));
        
        if($is_ssl){
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // 跳过证书检查
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, true);  // 从证书中检查SSL加密算法是否存在
        }
        $result = array();
        $result['result'] = curl_exec($ch);
        $result['http_code'] = curl_getinfo($ch,CURLINFO_HTTP_CODE);
        if (0 != curl_errno($ch)) {
            $result['error']  = "Error:\n" . curl_error($ch);
        }
        curl_close($ch);
        return $result;
    }</code>
<code>我个人感觉和代码应该关系不大,不知是哪的问题。望各路大神不吝赐教,指点迷津,不胜
感激。</code>

1. Set a larger timeout;
2. Restart the server;
3. Check whether the DNS is normal

Your server is having a seizure

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn