Thinkphp5 and QueryList implement the page collection function (crawler)-ThinkPHP-php.cn

Home

PHP Framework

ThinkPHP

Thinkphp5 and QueryList implement the page collection function (crawler)

藏色散人

Jan 28, 2020 pm 01:57 PM

querylistthinkphp5

What is QueryList?

QueryList is a set of PHP tools for content collection, which uses a more modern Development ideas, simple and elegant syntax, and strong scalability. Compared with the traditional use of obscure regular expressions for collection, QueryList uses a more powerful and elegant CSS selector for collection, which greatly lowers the threshold for PHP collection, and also makes the collection code easy to read and maintain, allowing you to Say goodbye to obscure and difficult-to-maintain regular expressions.

QueryList provides a complete set of content collection solutions

● DOM content selection: CSS selector

● HTTP client Terminal: GuzzleHTTP

● Content filtering: CSS selector

● Solving garbled characters: Built-in multiple sets of garbled code solutions

● Additional features: Rich extension plug-ins

Premise

The project mainly uses the thinkphp5 framework, and mainly uses the two files `QueryList.php` and `phpQuery.php`. We can switch to the project directory, create a new QL in extend, and then execute the composer command in the QL directory to install QueryList:

composer require jaeger/querylist

Then add use QL\QueryList to the controller that needs to be used; and then in the controller The code has been written. The following is an example

//需要采集的目标页面
$page = &#39;http://cms.querylist.cc/news/566.html&#39;;
//采集规则
$reg = array(
   //采集文章标题
   &#39;title&#39; => array(&#39;h1&#39;,&#39;text&#39;),
   //采集文章发布日期,这里用到了QueryList的过滤功能，过滤掉span标签和a标签
   &#39;date&#39; => array(&#39;.pt_info&#39;,&#39;text&#39;,&#39;-span -a&#39;,function($content){
       //用回调函数进一步过滤出日期
       $arr = explode(&#39; &#39;,$content);
       return $arr[0];
   }),
   //采集文章正文内容,利用过滤功能去掉文章中的超链接，但保留超链接的文字，并去掉版权、JS代码等无用信息
   &#39;content&#39; => array(&#39;.post_content&#39;,&#39;html&#39;,&#39;a -.content_copyright -script&#39;,function($content){
       //利用回调函数下载文章中的图片并替换图片路径为本地路径
       //使用本例请确保当前目录下有image文件夹，并有写入权限
       //由于QueryList是基于phpQuery的，所以可以随时随地使用phpQuery，当然在这里也可以使用正则或者其它方式达到同样的目的

       $doc=\phpQuery::newDocumentHTML($content);
       $imgs = pq($doc)->find(&#39;img&#39;);
       foreach ($imgs as $img) {
           $src = &#39;http://cms.querylist.cc&#39;.pq($img)->attr(&#39;src&#39;);
           $localSrc = md5($src).&#39;.jpg&#39;;
           $stream = file_get_contents($src);
           file_put_contents($localSrc,$stream);
           pq($img)->attr(&#39;src&#39;,$localSrc);
       }
       return $doc->htmlOuter();
   })
);
$rang = &#39;.content&#39;;
$ql = QueryList::Query($page,$reg,$rang);
$data = $ql->getData();
//打印结果
print_r($data);

Note:

needs to be added in front when using the phpQuery class on \, because the namespace is not used in phpQuery.php, because after using the namespace, QueryList.php cannot use the phpQuery class.

For more related ThinkPHP knowledge, please visit ThinkPHP Tutorial!

The above is the detailed content of Thinkphp5 and QueryList implement the page collection function (crawler). For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:cnblogs. If there is any infringement, please contact admin@php.cn delete

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Saving in R.E.P.O. Explained (And Save Files)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux latest version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7572

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

110