thinkphp实现自动采集功能的三种方法:
方法一:QueryList
个人感觉比较好用,采集详情比较不错的选择,但是采集复杂一点的列表,不好用。具体使用:
控制器示例:
public function index(){ // 使用采集类 // 使用手册 :http://www.php.cn/php/php-QueryList3-ThinkPHP.html import('Org.QL.QueryList'); $url = "http://www.zyctd.com/gqqg/"; $reg = array(); $reg['title'] = array('.sulist_title','text'); $reg['shuliang'] = array('.su_li1','html'); $obj = new \QueryList($url,$reg); $data = $obj->jsonArr; // foreach($data as $v){ // echo "<br>".$v['title'].'___'.$v['shuliang']."<br>"; // } p($data); }
相关推荐:《ThinkPHP教程》
方法二:simple_html_dom
这个方法比较适合采集一点结构简单的页面,HTML标签的类名比较明确的页面,还不错。具体使用:
控制器示例:
public function index(){ // 参考文档:http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart // 下载地址:https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php // 使用方法:http://www.thinkphp.cn/topic/21635.html import("Org.Util.simple_html_dom", '', '.php'); $html = file_get_html('http://www.zyctd.com/gqqg/'); $ret = $html->find('.supply_list_box ul',0)->first_child(); foreach($ret as $v){ echo $v; }; }
方法三:获取页面HTMl,进行正则匹配采集
举例一个Demo:
采集一个页面:
http://www.zyctd.com/gqqg/
我要获取上面的四个信息:标题,数量,时间,跳转链接。
获取这些信息,通过上面两种方法都采集不到,最后才选用的正则来采集。具体方法:
public function index(){ $url = "http://www.zyctd.com/gqqg/"; // http://www.zyctd.com/gqqg-p1.html $supplyDB = M('supply'); $urlList = array(); $array = array(); for($x=1; $x<=1; $x++) { array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html"); }; foreach($urlList as $v){ $curPageList = $this->getInfo($v); array_push($array,$curPageList); }; foreach($array as $v){ foreach($v as $vv){ //echo $vv['title']."__".$vv['weight']."__".$vv['time']."<br>"; $data = array(); $data['title'] = $vv['title']; $data['weight'] = $vv['weight']; $data['add_time'] = $vv['add_time']; $data['url'] = $vv['url']; //$res = $supplyDB->add($data); //echo $res; echo "<p><span style='display:inline-block; width:110px;'>".$vv['title']."</span> <span style='display:inline-block; width:110px;'>".$vv['weight']."</span> <span style='display:inline-block; width:110px;'>".$vv['add_time']."</span> <span style='display:inline-block; width:110px;'>".$vv['url']."</span></p>"; } } // 获取信息 //$curPageList = $this->getInfo($html); //p($curPageList); } private function getInfo($url){ $html = $this->getHtml($url); $array = array(); // 匹配所有的标题 preg_match_all("#<divclass=\"sulist_title\"><i></i><span>(.*?)</span></div>#",$html,$matches); $all_title = $matches[1]; preg_match_all("#<i>发布时间:</i><span>(.*?)</span>#",$html,$matches); // 匹配所有的发布时间 $all_time = $matches[1]; // 匹配所有的求购数量 preg_match_all("#<i>求购数量:</i><span>(.*?)</span>#",$html,$matches); $all_weight = $matches[1]; // 匹配跳转链接 preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches); $all_url = $matches[1]; // 组合 foreach($all_title as $k => $v){ $arr = array(); $arr['title'] = $v; $arr['weight'] = $all_weight[$k]; $arr['add_time'] = $all_time[$k]; $arr['url'] = $all_url[$k]; array_push($array,$arr); } return $array; } private function getHtml($url){ $html = file_get_contents($url); $html = preg_replace("#\n#","",$html); $html = preg_replace("#\r#","",$html); $html = preg_replace("#\\s#","",$html); return $html; }
The above is the detailed content of How to implement thinkphp automatic collection. For more information, please follow other related articles on the PHP Chinese website!

The article discusses ThinkPHP's built-in testing framework, highlighting its key features like unit and integration testing, and how it enhances application reliability through early bug detection and improved code quality.

Article discusses using ThinkPHP for real-time stock market data feeds, focusing on setup, data accuracy, optimization, and security measures.

The article discusses key considerations for using ThinkPHP in serverless architectures, focusing on performance optimization, stateless design, and security. It highlights benefits like cost efficiency and scalability, but also addresses challenges

The article discusses implementing service discovery and load balancing in ThinkPHP microservices, focusing on setup, best practices, integration methods, and recommended tools.[159 characters]

ThinkPHP's IoC container offers advanced features like lazy loading, contextual binding, and method injection for efficient dependency management in PHP apps.Character count: 159

The article discusses using ThinkPHP to build real-time collaboration tools, focusing on setup, WebSocket integration, and security best practices.

ThinkPHP benefits SaaS apps with its lightweight design, MVC architecture, and extensibility. It enhances scalability, speeds development, and improves security through various features.

The article outlines building a distributed task queue system using ThinkPHP and RabbitMQ, focusing on installation, configuration, task management, and scalability. Key issues include ensuring high availability, avoiding common pitfalls like imprope


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 English version
Recommended: Win version, supports code prompts!

Dreamweaver CS6
Visual web development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft