집 >백엔드 개발 >PHP 튜토리얼 >simple_html_dom을 기반으로 하는 PHP 자체 제작 크롤러 v1.0

simple_html_dom을 기반으로 하는 PHP 자체 제작 크롤러 v1.0

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB원래의: 2016-08-08 09:30:461023검색

웹 페이지 구문 분석과 크롤러 제작에 대한 나의 열정은 전혀 줄어들지 않았습니다. 오늘 저는 오픈 소스 simple_html_dom.php 구문 분석 프레임워크를 사용하여 크롤러를 만들었습니다.

<?php
/*
	*.Pho spider v1.0
	*.Written by Radish.ghost 2015.1.20
*/
//error_reporting(1); //close error report
//curl model //I will realize it in later versions
include_once("simple_html_dom.php");
$html=file_get_html(&#39;http://www.baidu.com&#39;);//The url which you want dig

$tmp=array();//Save the url in the first dig
foreach($html->find('a') as $e) 
{
	$f=$e->href;
	//if($f[10]==':')continue;
	if($f[0]=='/')$f='http://www.baidu.com'.$f;//Completion the url
	if($f[4]=='s')continue;//If the url is "https://" continue (the simple_html_dom might can't prase the https:// url)  
	if(stripos($f,"baidu")==FALSE)continue;//If the url not in this website continue
    echo $f . '<br>';
	$tmp[$cun++]=$f; //Save the urls into array
}

foreach($tmp as $r) //Dig the urls in $tmp[]
{
$html2=file_get_html($r); //Redo the step
foreach($html2->find('a') as $a)
{
	$u=$a->href;
	if($u[0]=='/')$u='http://www.baidu.com'.$u;
	if($u[4]=='s')continue;
	if(stripos($u,"baidu")==FALSE)continue;
	echo $u.'<br>';
}
$html2=null;
}
?>

//결국에는 항상 치명적인 일이 있을 것입니다. 오류: D:xampphtdocshtmlindex.php의 객체가 아닌 멤버 함수 find() 호출 라인 21 선배님들과 이야기를 나눈 뒤 작은 실수를 많이 바로잡았는데 아직 해결되지 않았으면 좋겠습니다.

--------- -------------구분선------------

simple_html_dom 다운로드 ：

https://github.com/Ph0enixxx/simple_html_dom

= = 집에 있는 컴퓨터에서는 git4win을 사용할 수 없습니다

위 내용은 simple_html_dom을 기반으로 한 PHP 자체 제작 크롤러 v1.0을 소개하며, 관련 내용도 포함되어 있어 PHP 튜토리얼에 관심이 있는 친구들에게 도움이 되기를 바랍니다.

성명：

이전 기사：nginx를 사용하여 Windows 환경에서 axure 데모 프로토타입으로 서버 구축다음 기사：nginx를 사용하여 Windows 환경에서 axure 데모 프로토타입으로 서버 구축

simple_html_dom을 기반으로 하는 PHP 자체 제작 크롤러 v1.0

관련 기사