thinkphp实现自动采集功能的三种方法:
方法一:QueryList
个人感觉比较好用,采集详情比较不错的选择,但是采集复杂一点的列表,不好用。具体使用:
控制器示例:
public function index(){ // 使用采集类 // 使用手册 :http://www.php.cn/php/php-QueryList3-ThinkPHP.html import('Org.QL.QueryList'); $url = "http://www.zyctd.com/gqqg/"; $reg = array(); $reg['title'] = array('.sulist_title','text'); $reg['shuliang'] = array('.su_li1','html'); $obj = new \QueryList($url,$reg); $data = $obj->jsonArr; // foreach($data as $v){ // echo "<br>".$v['title'].'___'.$v['shuliang']."<br>"; // } p($data); }
相关推荐:《ThinkPHP教程》
方法二:simple_html_dom
这个方法比较适合采集一点结构简单的页面,HTML标签的类名比较明确的页面,还不错。具体使用:
控制器示例:
public function index(){ // 参考文档:http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart // 下载地址:https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php // 使用方法:http://www.thinkphp.cn/topic/21635.html import("Org.Util.simple_html_dom", '', '.php'); $html = file_get_html('http://www.zyctd.com/gqqg/'); $ret = $html->find('.supply_list_box ul',0)->first_child(); foreach($ret as $v){ echo $v; }; }
方法三:获取页面HTMl,进行正则匹配采集
举例一个Demo:
采集一个页面:
http://www.zyctd.com/gqqg/
我要获取上面的四个信息:标题,数量,时间,跳转链接。
获取这些信息,通过上面两种方法都采集不到,最后才选用的正则来采集。具体方法:
public function index(){ $url = "http://www.zyctd.com/gqqg/"; // http://www.zyctd.com/gqqg-p1.html $supplyDB = M('supply'); $urlList = array(); $array = array(); for($x=1; $x<=1; $x++) { array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html"); }; foreach($urlList as $v){ $curPageList = $this->getInfo($v); array_push($array,$curPageList); }; foreach($array as $v){ foreach($v as $vv){ //echo $vv['title']."__".$vv['weight']."__".$vv['time']."<br>"; $data = array(); $data['title'] = $vv['title']; $data['weight'] = $vv['weight']; $data['add_time'] = $vv['add_time']; $data['url'] = $vv['url']; //$res = $supplyDB->add($data); //echo $res; echo "<p><span style='display:inline-block; width:110px;'>".$vv['title']."</span> <span style='display:inline-block; width:110px;'>".$vv['weight']."</span> <span style='display:inline-block; width:110px;'>".$vv['add_time']."</span> <span style='display:inline-block; width:110px;'>".$vv['url']."</span></p>"; } } // 获取信息 //$curPageList = $this->getInfo($html); //p($curPageList); } private function getInfo($url){ $html = $this->getHtml($url); $array = array(); // 匹配所有的标题 preg_match_all("#<divclass=\"sulist_title\"><i></i><span>(.*?)</span></div>#",$html,$matches); $all_title = $matches[1]; preg_match_all("#<i>发布时间:</i><span>(.*?)</span>#",$html,$matches); // 匹配所有的发布时间 $all_time = $matches[1]; // 匹配所有的求购数量 preg_match_all("#<i>求购数量:</i><span>(.*?)</span>#",$html,$matches); $all_weight = $matches[1]; // 匹配跳转链接 preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches); $all_url = $matches[1]; // 组合 foreach($all_title as $k => $v){ $arr = array(); $arr['title'] = $v; $arr['weight'] = $all_weight[$k]; $arr['add_time'] = $all_time[$k]; $arr['url'] = $all_url[$k]; array_push($array,$arr); } return $array; } private function getHtml($url){ $html = file_get_contents($url); $html = preg_replace("#\n#","",$html); $html = preg_replace("#\r#","",$html); $html = preg_replace("#\\s#","",$html); return $html; }
The above is the detailed content of How to implement thinkphp automatic collection. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1
Easy-to-use and free code editor

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool
