Home  >  Article  >  Backend Development  >  PHP, crawler PHP implements the simplest crawler prototype

PHP, crawler PHP implements the simplest crawler prototype

巴扎黑
巴扎黑Original
2016-11-24 13:41:001197browse

The simplest crawler model should be like this: given an initial URL, the crawler pulls down the content, finds the URLs in the page, and starts crawling using these URLs as the starting point.

The following is the simplest crawler model implemented in PHP.

<?php
/**
 * 爬虫程序 -- 原型
 * 
 * BookMoth 2009-02-21
 */
/**
 * 从给定的url获取html内容
 *
 * @param string $url
 * @return string
 */
function _getUrlContent($url){
$handle = fopen($url, "r");
if($handle){
$content = stream_get_contents($handle,1024*1024);
return $content;
}else{
return false;
}
}
/**
 * 从html内容中筛选链接
 *
 * @param string $web_content
 * @return array
 */
function _filterUrl($web_content){
$reg_tag_a = &#39;/<[a|A].*?href=[/&#39;/"]{0,1}([^>/&#39;/"/ ]*).*?>/&#39;;
$result = preg_match_all($reg_tag_a,$web_content,$match_result);
if($result){
return $match_result[1];
}
}
/**
 * 修正相对路径
 *
 * @param string $base_url
 * @param array $url_list
 * @return array
 */
function _reviseUrl($base_url,$url_list){
$url_info = parse_url($base_url);
$base_url = $url_info["scheme"].&#39;://&#39;;
if($url_info["user"]&&$url_info["pass"]){
$base_url .= $url_info["user"].":".$url_info["pass"]."@";
}
$base_url .= $url_info["host"];
if($url_info["port"]){
$base_url .= ":".$url_info["port"];
}
$base_url .= $url_info["path"];
print_r($base_url);
if(is_array($url_list)){
foreach ($url_list as $url_item) {
if(preg_match(&#39;/^http/&#39;,$url_item)){
//已经是完整的url
$result[] = $url_item;
}else {
//不完整的url
$real_url = $base_url.&#39;/&#39;.$url_item;
$result[] = $real_url;
}
}
return $result;
}else {
return;
}
}
/**
 * 爬虫
 *
 * @param string $url
 * @return array
 */
function crawler($url){
$content = _getUrlContent($url);
if($content){
$url_list = _reviseUrl($url,_filterUrl($content));
if($url_list){
return $url_list;
}else {
return ;
}
}else{
return ;
}
}
/**
 * 测试用主程序
 *
 */
function main(){
$current_url = "http://hao123.com/";//初始url
$fp_puts = fopen("url.txt","ab");//记录url列表
$fp_gets = fopen("url.txt","r");//保存url列表
do{
$result_url_arr = crawler($current_url);
if($result_url_arr){
foreach ($result_url_arr as $url) {
fputs($fp_puts,$url."/r/n");
}
}
}while ($current_url = fgets($fp_gets,1024));//不断获得url
}
main();
?>


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:php expressionNext article:php expression