Thinkphp5 and QueryList implement the page collection function (crawler)
What is QueryList?
QueryList is a set of PHP tools for content collection, which uses a more modern Development ideas, simple and elegant syntax, and strong scalability. Compared with the traditional use of obscure regular expressions for collection, QueryList uses a more powerful and elegant CSS selector for collection, which greatly lowers the threshold for PHP collection, and also makes the collection code easy to read and maintain, allowing you to Say goodbye to obscure and difficult-to-maintain regular expressions.
QueryList provides a complete set of content collection solutions
● DOM content selection: CSS selector
● HTTP client Terminal: GuzzleHTTP
● Content filtering: CSS selector
● Solving garbled characters: Built-in multiple sets of garbled code solutions
● Additional features: Rich extension plug-ins
Premise
The project mainly uses the thinkphp5 framework, and mainly uses the two files `QueryList.php` and `phpQuery.php`. We can switch to the project directory, create a new QL in extend, and then execute the composer command in the QL directory to install QueryList:
composer require jaeger/querylist
Then add use QL\QueryList to the controller that needs to be used; and then in the controller The code has been written. The following is an example
//需要采集的目标页面 $page = 'http://cms.querylist.cc/news/566.html'; //采集规则 $reg = array( //采集文章标题 'title' => array('h1','text'), //采集文章发布日期,这里用到了QueryList的过滤功能,过滤掉span标签和a标签 'date' => array('.pt_info','text','-span -a',function($content){ //用回调函数进一步过滤出日期 $arr = explode(' ',$content); return $arr[0]; }), //采集文章正文内容,利用过滤功能去掉文章中的超链接,但保留超链接的文字,并去掉版权、JS代码等无用信息 'content' => array('.post_content','html','a -.content_copyright -script',function($content){ //利用回调函数下载文章中的图片并替换图片路径为本地路径 //使用本例请确保当前目录下有image文件夹,并有写入权限 //由于QueryList是基于phpQuery的,所以可以随时随地使用phpQuery,当然在这里也可以使用正则或者其它方式达到同样的目的 $doc=\phpQuery::newDocumentHTML($content); $imgs = pq($doc)->find('img'); foreach ($imgs as $img) { $src = 'http://cms.querylist.cc'.pq($img)->attr('src'); $localSrc = md5($src).'.jpg'; $stream = file_get_contents($src); file_put_contents($localSrc,$stream); pq($img)->attr('src',$localSrc); } return $doc->htmlOuter(); }) ); $rang = '.content'; $ql = QueryList::Query($page,$reg,$rang); $data = $ql->getData(); //打印结果 print_r($data);
Note:
needs to be added in front when using the phpQuery class on \, because the namespace is not used in phpQuery.php, because after using the namespace, QueryList.php cannot use the phpQuery class.
For more related ThinkPHP knowledge, please visit ThinkPHP Tutorial!
The above is the detailed content of Thinkphp5 and QueryList implement the page collection function (crawler). For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version
God-level code editing software (SublimeText3)