Home > Article > Backend Development > A brief analysis of how to use PHP to crawl data asynchronously
Speaking of crawlers, many people will think of python crawlers, because it does have great advantages. But in fact, PHP can also be used to crawl data asynchronously. Let me introduce to you how to use PHP to crawl data asynchronously.
What is a web crawler?
A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. The traditional crawler starts from the URL of one or several initial web pages and obtains the URL on the initial web page. During the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stopping conditions of the system are met.
What is the use of crawlers?
As a general search engine web page collector. (google, baidu)
Build a vertical search engine.
Scientific research: online human behavior, online community evolution, human dynamics research , Empirical research in fields such as econometric sociology, complex networks, data mining, etc. all require large amounts of data, and web crawlers are a powerful tool for collecting relevant data.
Peeping, hacking, spamming...
QueryList Introduction and Features
QueryList is a set of simple, elegant, and scalable PHP collection tools (crawlers) based on phpQuery.
Features:
Have the exact same CSS3 DOM selector as jQuery
Have the exact same DOM operation as jQuery API
has a universal list collection solution
has a powerful HTTP request suite, which can easily implement: simulated login, fake browser, HTTP Proxy and other complex network requests
Have garbled solution
Have powerful content filtering function, you can use jQuey selector to filter content
Have a high degree of modular design and strong scalability
Have an expressive API
Have high-quality documentation
Have rich plug-ins
Have professional Q&A community and communication group
Through plug-ins, you can easily achieve things such as:
Multi-thread collection
Picture localization
Simulate browser behavior, such as: submit Form form
Web crawler
Environment requirements
PHP >= 7.0
If your PHP version is still stuck at PHP5, or you don’t know how to use Composer, you can choose to use QueryList3. QueryList3 supports php5.3 and manual installation. QueryList3 documentation: http://v3.querylist.cc
Installation
Installation via Composer:
composer require jaeger/querylist
Use
Element operation
Collect all image addresses of "Nitu.com"
QueryList::get('http://www.nipic.com')->find('img')->attrs('src');
Collect Baidu search results
$ql = QueryList::get('http://www.baidu.com/s?wd=QueryList'); $ql->find('title')->text(); // 获取网站标题 $ql->find('meta[name=keywords]')->content; // 获取网站头部关键词 $ql->find('h3>a')->texts(); //获取搜索结果标题列表 $ql->find('h3>a')->attrs('href'); //获取搜索结果链接列表 $ql->find('img')->src; //获取第一张图片的链接地址 $ql->find('img:eq(1)')->src; //获取第二张图片的链接地址 $ql->find('img')->eq(2)->src; //获取第三张图片的链接地址 // 遍历所有图片 $ql->find('img')->map(function($img){ echo $img->alt; //打印图片的alt属性 });
More usage
$ql->find('#head')->append('<div>追加内容</div>')->find('div')->htmls(); $ql->find('.two')->children('img')->attrs('alt'); //获取class为two元素下的所有img孩子节点 //遍历class为two元素下的所有孩子节点 $data = $ql->find('.two')->children()->map(function ($item){ //用is判断节点类型 if($item->is('a')){ return $item->text(); }elseif($item->is('img')) { return $item->alt; } }); $ql->find('a')->attr('href', 'newVal')->removeClass('className')->html('newHtml')->... $ql->find('div > p')->add('div > ul')->filter(':has(a)')->find('p:first')->nextAll()->andSelf()->... $ql->find('div.old')->replaceWith( $ql->find('div.new')->clone())->appendTo('.trash')->prepend('Deleted')->...
List collection
Collect the titles and links of Baidu search results list:
$data = QueryList::get('http://www.baidu.com/s?wd=QueryList') // 设置采集规则 ->rules([ 'title'=>array('h3','text'), 'link'=>array('h3>a','href') ]) ->query()->getData(); print_r($data->all());
Collection results:
Array ( [0] => Array ( [title] => QueryList|基于phpQuery的无比强大的PHP采集工具 [link] => http://www.baidu.com/link?url=GU_YbDT2IHk4ns1tjG2I8_vjmH0SCJEAPuuZN ) [1] => Array ( [title] => PHP 用QueryList抓取网页内容 - wb145230 - 博客园 [link] => http://www.baidu.com/link?url=zn0DXBnrvIF2ibRVW34KcRVFG1_bCdZvqvwIhUqiXaS ) [2] => Array ( [title] => 介绍- QueryList指导文档 [link] => http://www.baidu.com/link?url=pSypvMovqS4v2sWeQo5fDBJ4EoYhXYi0Lxx ) //... )
Encoding conversion
// 输出编码:UTF-8,输入编码:GB2312 QueryList::get('https://top.etao.com')->encoding('UTF-8','GB2312')->find('a')->texts(); // 输出编码:UTF-8,输入编码:自动识别 QueryList::get('https://top.etao.com')->encoding('UTF-8')->find('a')->texts();
HTTP network operation
Log in to Sina Weibo with cookies
//采集新浪微博需要登录才能访问的页面 $ql = QueryList::get('http://weibo.com','param1=testvalue & params2=somevalue',[ 'headers' => [ //填写从浏览器获取到的cookie 'Cookie' => 'SINAGLOBAL=546064; wb_cmtLike_2112031=1; wvr=6;....' ] ]); //echo $ql->getHtml(); echo $ql->find('title')->text(); //输出: 我的首页 微博-随时随地发现新鲜事
Use Http proxy
$urlParams = ['param1' => 'testvalue','params2' => 'somevalue']; $opts = [ // 设置http代理 'proxy' => 'http://222.141.11.17:8118', //设置超时时间,单位:秒 'timeout' => 30, // 伪造http头 'headers' => [ 'Referer' => 'https://querylist.cc/', 'User-Agent' => 'testing/1.0', 'Accept' => 'application/json', 'X-Foo' => ['Bar', 'Baz'], 'Cookie' => 'abc=111;xxx=222' ] ]; $ql->get('http://httpbin.org/get',$urlParams,$opts); // echo $ql->getHtml();
Simulated login
// 用post登录 $ql = QueryList::post('http://xxxx.com/login',[ 'username' => 'admin', 'password' => '123456' ])->get('http://xxx.com/admin'); //采集需要登录才能访问的页面 $ql->get('http://xxx.com/admin/page'); //echo $ql->getHtml();
Form form operation
Simulate login to GitHub
// 获取QueryList实例 $ql = QueryList::getInstance(); //获取到登录表单 $form = $ql->get('https://github.com/login')->find('form'); //填写GitHub用户名和密码 $form->find('input[name=login]')->val('your github username or email'); $form->find('input[name=password]')->val('your github password'); //序列化表单数据 $fromData = $form->serializeArray(); $postData = []; foreach ($fromData as $item) { $postData[$item['name']] = $item['value']; } //提交登录表单 $actionUrl = 'https://github.com'.$form->attr('action'); $ql->post($actionUrl,$postData); //判断登录是否成功 // echo $ql->getHtml(); $userName = $ql->find('.header-nav-current-user>.css-truncate-target')->text(); if($userName) { echo '登录成功!欢迎你:'.$userName; }else{ echo '登录失败!'; }
Bind function extension
Customize a myHttp method:
$ql = QueryList::getInstance(); //绑定一个myHttp方法到QueryList对象 $ql->bind('myHttp',function ($url){ $html = file_get_contents($url); $this->setHtml($html); return $this; }); //然后就可以通过注册的名字来调用 $data = $ql->myHttp('https://toutiao.io')->find('h3 a')->texts(); print_r($data->all());
Or encapsulate the implementation into a class, and then bind it like this:
$ql->bind('myHttp',function ($url){ return new MyHttp($this,$url); });
Plug-in usage
Use CURL multi-thread plug-in, multi-thread collection GitHub rankings :
$ql = QueryList::use(CurlMulti::class); $ql->curlMulti([ 'https://github.com/trending/php', 'https://github.com/trending/go', //.....more urls ]) // 每个任务成功完成调用此回调 ->success(function (QueryList $ql,CurlMulti $curl,$r){ echo "Current url:{$r['info']['url']} \r\n"; $data = $ql->find('h3 a')->texts(); print_r($data->all()); }) // 每个任务失败回调 ->error(function ($errorInfo,CurlMulti $curl){ echo "Current url:{$errorInfo['info']['url']} \r\n"; print_r($errorInfo['error']); }) ->start([ // 最大并发数 'maxThread' => 10, // 错误重试次数 'maxTry' => 3, ]);
For more details, please check GitHub: https://github.com/jae-jae/QueryList
Recommended learning: "PHP Video Tutorial"
The above is the detailed content of A brief analysis of how to use PHP to crawl data asynchronously. For more information, please follow other related articles on the PHP Chinese website!