Maison >cadre php >PensezPHP >Comment implémenter la collecte automatique thinkphp

Comment implémenter la collecte automatique thinkphp

爱喝马黛茶的安东尼original: 2019-08-22 09:46:164116parcourir

thinkphp实现自动采集功能的三种方法：

方法一：QueryList

个人感觉比较好用，采集详情比较不错的选择，但是采集复杂一点的列表，不好用。具体使用：

Comment implémenter la collecte automatique thinkphp

控制器示例：

public function index(){
    // 使用采集类
    // 使用手册 ：http://www.php.cn/php/php-QueryList3-ThinkPHP.html
    import(&#39;Org.QL.QueryList&#39;);
    $url = "http://www.zyctd.com/gqqg/";
    $reg = array();
    $reg[&#39;title&#39;] = array(&#39;.sulist_title&#39;,&#39;text&#39;);
    $reg[&#39;shuliang&#39;] = array(&#39;.su_li1&#39;,&#39;html&#39;);
    $obj = new \QueryList($url,$reg);
    $data = $obj->jsonArr;
    // foreach($data as $v){
    //     echo "<br>".$v[&#39;title&#39;].&#39;___&#39;.$v[&#39;shuliang&#39;]."<br>";
    // }
    p($data);
}

相关推荐：《ThinkPHP教程》

方法二：simple_html_dom

这个方法比较适合采集一点结构简单的页面，HTML标签的类名比较明确的页面，还不错。具体使用：

Comment implémenter la collecte automatique thinkphp

控制器示例：

public function index(){
    // 参考文档：http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart
    // 下载地址：https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php
    // 使用方法：http://www.thinkphp.cn/topic/21635.html
    import("Org.Util.simple_html_dom", &#39;&#39;, &#39;.php&#39;);
    $html = file_get_html(&#39;http://www.zyctd.com/gqqg/&#39;);
    $ret = $html->find(&#39;.supply_list_box ul&#39;,0)->first_child();
    foreach($ret as $v){
        echo $v;
    };
}

方法三：获取页面HTMl，进行正则匹配采集

举例一个Demo：

采集一个页面：

http://www.zyctd.com/gqqg/

我要获取上面的四个信息：标题，数量，时间，跳转链接。

Comment implémenter la collecte automatique thinkphp

获取这些信息，通过上面两种方法都采集不到，最后才选用的正则来采集。具体方法：

public function index(){
    $url = "http://www.zyctd.com/gqqg/";
    // http://www.zyctd.com/gqqg-p1.html
    $supplyDB = M(&#39;supply&#39;);    
    $urlList = array();
    $array = array();
    for($x=1; $x<=1; $x++) {
        array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html");
    };        
    foreach($urlList as $v){
        $curPageList = $this->getInfo($v);
        array_push($array,$curPageList);
    };
    foreach($array as $v){
        foreach($v as $vv){
            //echo $vv[&#39;title&#39;]."__".$vv[&#39;weight&#39;]."__".$vv[&#39;time&#39;]."<br>";
            $data = array();
            $data[&#39;title&#39;] = $vv[&#39;title&#39;];
            $data[&#39;weight&#39;] = $vv[&#39;weight&#39;];
            $data[&#39;add_time&#39;] = $vv[&#39;add_time&#39;];
            $data[&#39;url&#39;] = $vv[&#39;url&#39;];
            //$res = $supplyDB->add($data);
            //echo $res;
            echo "<p><span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;title&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;weight&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;add_time&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;url&#39;]."</span></p>";
        }
    }
        // 获取信息
        //$curPageList = $this->getInfo($html);
        //p($curPageList);
}
private function getInfo($url){
    $html = $this->getHtml($url);
    $array = array();
    // 匹配所有的标题
    preg_match_all("#<divclass=\"sulist_title\"><i></i><span>(.*?)</span></div>#",$html,$matches);
    $all_title = $matches[1];
    preg_match_all("#<i>发布时间：</i><span>(.*?)</span>#",$html,$matches);
    // 匹配所有的发布时间
    $all_time = $matches[1];
    // 匹配所有的求购数量
    preg_match_all("#<i>求购数量：</i><span>(.*?)</span>#",$html,$matches);
    $all_weight = $matches[1];
    // 匹配跳转链接
    preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches);
    $all_url = $matches[1];
    // 组合
    foreach($all_title as $k => $v){
        $arr = array();
        $arr[&#39;title&#39;] = $v;
        $arr[&#39;weight&#39;] = $all_weight[$k];
        $arr[&#39;add_time&#39;] = $all_time[$k];
        $arr[&#39;url&#39;] = $all_url[$k];
        array_push($array,$arr);
    }
    return $array;
}
private function getHtml($url){
    $html = file_get_contents($url);
    $html = preg_replace("#\n#","",$html);
    $html = preg_replace("#\r#","",$html);
    $html = preg_replace("#\\s#","",$html);
    return $html;
}

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!

html thinkphp http

Déclaration：

Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn

Article précédent：Comment utiliser la fonction majuscule tinkphpArticle suivant：Comment utiliser la fonction majuscule tinkphp

Articles Liés

Voir plus