Home >Backend Development >PHP Tutorial >PHP self-made crawler v1.0 based on simple_html_dom

PHP self-made crawler v1.0 based on simple_html_dom

WBOY
WBOYOriginal
2016-08-08 09:30:46988browse

My enthusiasm for web page parsing and crawler production has not diminished at all. Today I used the open source simple_html_dom.php parsing framework to make a crawler:

<?php
/*
	*.Pho spider v1.0
	*.Written by Radish.ghost 2015.1.20
*/
//error_reporting(1); //close error report
//curl model //I will realize it in later versions
include_once("simple_html_dom.php");
$html=file_get_html(&#39;http://www.baidu.com&#39;);//The url which you want dig

$tmp=array();//Save the url in the first dig
foreach($html->find('a') as $e) 
{
	$f=$e->href;
	//if($f[10]==':')continue;
	if($f[0]=='/')$f='http://www.baidu.com'.$f;//Completion the url
	if($f[4]=='s')continue;//If the url is "https://" continue (the simple_html_dom might can't prase the https:// url)  
	if(stripos($f,"baidu")==FALSE)continue;//If the url not in this website continue
    echo $f . '<br>';
	$tmp[$cun++]=$f; //Save the urls into array
}

foreach($tmp as $r) //Dig the urls in $tmp[]
{
$html2=file_get_html($r); //Redo the step
foreach($html2->find('a') as $a)
{
	$u=$a->href;
	if($u[0]=='/')$u='http://www.baidu.com'.$u;
	if($u[4]=='s')continue;
	if(stripos($u,"baidu")==FALSE)continue;
	echo $u.'<br>';
}
$html2=null;
}
?>

//In the end there will always be a Fatal error: Call to a member function find() on a non-object in D:xampphtdocshtmlindex.php on Line 21’s warning. After communicating with the seniors, I corrected a lot of small mistakes, but this still has not been solved. I hope someone can give me some advice

-------------------------------- -Separating line------------------------

simple_html_dom Download:

https://github.com/Ph0enixxx/simple_html_dom

= =I can’t use git4win on my home computer

The above introduces a v1.0 of PHP's self-made crawler based on simple_html_dom, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn