How to implement thinkphp automatic collection-ThinkPHP-php.cn

Home

PHP Framework

ThinkPHP

How to implement thinkphp automatic collection

爱喝马黛茶的安东尼

Aug 22, 2019 am 09:46 AM

thinkphp

thinkphp实现自动采集功能的三种方法：

方法一：QueryList

个人感觉比较好用，采集详情比较不错的选择，但是采集复杂一点的列表，不好用。具体使用：

How to implement thinkphp automatic collection

控制器示例：

public function index(){
    // 使用采集类
    // 使用手册 ：http://www.php.cn/php/php-QueryList3-ThinkPHP.html
    import(&#39;Org.QL.QueryList&#39;);
    $url = "http://www.zyctd.com/gqqg/";
    $reg = array();
    $reg[&#39;title&#39;] = array(&#39;.sulist_title&#39;,&#39;text&#39;);
    $reg[&#39;shuliang&#39;] = array(&#39;.su_li1&#39;,&#39;html&#39;);
    $obj = new \QueryList($url,$reg);
    $data = $obj->jsonArr;
    // foreach($data as $v){
    //     echo "<br>".$v[&#39;title&#39;].&#39;___&#39;.$v[&#39;shuliang&#39;]."<br>";
    // }
    p($data);
}

相关推荐：《ThinkPHP教程》

方法二：simple_html_dom

这个方法比较适合采集一点结构简单的页面，HTML标签的类名比较明确的页面，还不错。具体使用：

How to implement thinkphp automatic collection

控制器示例：

public function index(){
    // 参考文档：http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart
    // 下载地址：https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php
    // 使用方法：http://www.thinkphp.cn/topic/21635.html
    import("Org.Util.simple_html_dom", &#39;&#39;, &#39;.php&#39;);
    $html = file_get_html(&#39;http://www.zyctd.com/gqqg/&#39;);
    $ret = $html->find(&#39;.supply_list_box ul&#39;,0)->first_child();
    foreach($ret as $v){
        echo $v;
    };
}

方法三：获取页面HTMl，进行正则匹配采集

举例一个Demo：

采集一个页面：

http://www.zyctd.com/gqqg/

我要获取上面的四个信息：标题，数量，时间，跳转链接。

How to implement thinkphp automatic collection

获取这些信息，通过上面两种方法都采集不到，最后才选用的正则来采集。具体方法：

public function index(){
    $url = "http://www.zyctd.com/gqqg/";
    // http://www.zyctd.com/gqqg-p1.html
    $supplyDB = M(&#39;supply&#39;);    
    $urlList = array();
    $array = array();
    for($x=1; $x<=1; $x++) {
        array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html");
    };        
    foreach($urlList as $v){
        $curPageList = $this->getInfo($v);
        array_push($array,$curPageList);
    };
    foreach($array as $v){
        foreach($v as $vv){
            //echo $vv[&#39;title&#39;]."__".$vv[&#39;weight&#39;]."__".$vv[&#39;time&#39;]."<br>";
            $data = array();
            $data[&#39;title&#39;] = $vv[&#39;title&#39;];
            $data[&#39;weight&#39;] = $vv[&#39;weight&#39;];
            $data[&#39;add_time&#39;] = $vv[&#39;add_time&#39;];
            $data[&#39;url&#39;] = $vv[&#39;url&#39;];
            //$res = $supplyDB->add($data);
            //echo $res;
            echo "<p><span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;title&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;weight&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;add_time&#39;]."</span>
            <span style=&#39;display:inline-block; width:110px;&#39;>".$vv[&#39;url&#39;]."</span></p>";
        }
    }
        // 获取信息
        //$curPageList = $this->getInfo($html);
        //p($curPageList);
}
private function getInfo($url){
    $html = $this->getHtml($url);
    $array = array();
    // 匹配所有的标题
    preg_match_all("#<divclass=\"sulist_title\"><i></i><span>(.*?)</span></div>#",$html,$matches);
    $all_title = $matches[1];
    preg_match_all("#<i>发布时间：</i><span>(.*?)</span>#",$html,$matches);
    // 匹配所有的发布时间
    $all_time = $matches[1];
    // 匹配所有的求购数量
    preg_match_all("#<i>求购数量：</i><span>(.*?)</span>#",$html,$matches);
    $all_weight = $matches[1];
    // 匹配跳转链接
    preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches);
    $all_url = $matches[1];
    // 组合
    foreach($all_title as $k => $v){
        $arr = array();
        $arr[&#39;title&#39;] = $v;
        $arr[&#39;weight&#39;] = $all_weight[$k];
        $arr[&#39;add_time&#39;] = $all_time[$k];
        $arr[&#39;url&#39;] = $all_url[$k];
        array_push($array,$arr);
    }
    return $array;
}
private function getHtml($url){
    $html = file_get_contents($url);
    $html = preg_replace("#\n#","",$html);
    $html = preg_replace("#\r#","",$html);
    $html = preg_replace("#\\s#","",$html);
    return $html;
}

The above is the detailed content of How to implement thinkphp automatic collection. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.