Home  >  Article  >  Backend Development  >  Introduction and use of snoopy

Introduction and use of snoopy

WBOY
WBOYOriginal
2016-08-08 09:24:271947browse

Snoopy ​​is a php class, which is used to simulate the functions of a browser and can obtain web content and send forms. The correct operation of Snoopy requires that your server’s PHP version is 4 or above, and supports PCRE (Perl Compatible Regular Expressions), basic LAMP All services are supported. The official download address of the Snoopy class is: http://snoopy.sourceforge.net/1. Some features of Snoopy:1.Catch the web page Content fetch2.Fetch the text content of the webpage (RemoveHTMLtag) fetchtext3. Get links from web pages, forms fetchlinks fetchform4.Support proxy host5.Support basic username/Password verification6. Support setting user_agent, referer (origin), cookies and header content(header textpieces)7. supports browser redirection and can control the redirection depth 8. can expand the links in the web page into high-quality url (default)9.Submit data and get the return value10. support Tracking HTMLframework11. supports passing cookies when redirecting. It requires php4 or above. Since it is a php class, no need Expanded support, the best choice when the server does not support curl. 2. Class method:fetch($URI) This is the method used to crawl the content of the web page. The $URI parameter is the URL address of the crawled web page. The fetched results are stored in $this->results . If you are scraping a frame, Snoopy will track each frame and store it in an array, and then store it in $this->results. fetchtext($URI)This method is similar to fetch(). The only difference is that this method will remove the HTML tags and other irrelevant data, and only return the text content in the web page. . fetchform($URI)This method is similar to fetch(). The only difference is that this method will remove the HTML tags and other irrelevant data, and only return the form content in the webpage (form). fetchlinks($URI)This method is similar to fetch(), the only difference is that this method will remove the HTML tags and other irrelevant data, and only return the links in the webpage (link). By default, relative links will be automatically completed and converted into a complete URL. submit($URI,$formvars) This method sends a confirmation form to the link address specified by $URL. $formvars is an array that stores form parameters. submittext($URI,$formvars)This method is similar to submit(). The only difference is that this method will remove the HTML tags and other irrelevant data, and only return the information after login. Text content in web pages. submitlinks($URI)This method is similar to submit(). The only difference is that this method will remove the HTML tags and other irrelevant data, and only return the links in the webpage (link). By default, relative links will be automatically completed and converted into a complete URL.3. Class attributes: (Default value is in brackets)$host Connected host $port Connected port $proxy_host The proxy host used, If any $proxy_port The proxy host port used, if any $agent Disguise (Snoopy v0.1)$referer Routine information, if any $cookies cookies, if any $rawheaders Other header information , if any $maxredirs maximum number of redirects, 0=not allowed (5)$offsiteok whether or not to allow redirects off-site. (true)$expandlinks Whether to complete all links to the complete address (true)$user Authenticated user name , If any $pass Authentication username, If any $accept http Accept type (image/gif, image/x-xbitmap, image /jpeg, image/pjpeg, */*)$error Where to report the error, If any$response_code The response code returned from the server$headers Returned from the server Header information $maxlength Maximum return data length$read_timeout Read operation timeout (requires PHP 4 Beta 4+) is set to 0 for no timeout$ timed_out If a read operation times out, this attribute returns true (requires PHP 4 Beta 4+)$maxframes The maximum number of frames allowed to be tracked$status crawled http Status $temp_dir The directory of temporary files that the web server can write to (/tmp)$curl_path cURL binary , If there is no cURL binary, set it for falseFour. The following is the democopy codeinclude "Snoopy.class.php";$snoopy = new Snoopy;$snoopy->proxy_host = "www.phpoac.com" ;$snoopy->proxy_port = "8080";$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";$snoopy->referer = "http://www.phpoac.com/";$snoopy->cookies["SessionID"] = 238472834723489l;$snoopy->cookies["favoriteColor"] = "RED";$ snoopy->rawheaders["Pragma"] = "no-cache";$snoopy->maxredirs = 2;$snoopy->offsiteok = false;$snoopy->expandlinks = false; $snoopy->user = "joe";$snoopy->pass = "bloe";if($snoopy->fetchtext("http://www.phpoac.com")) {echo "

".htmlspecialchars($snoopy->results)." 
n";
}
else //CollectionphpOpen Source Networkset_time_limit(0);require_once("Snoopy.class.php");$snoopy=new Snoopy(); //Log in to the forum$submit_url = "http://www.phpoac.com/bbs/logging.php?action=login";$submit_vars["loginmode"] = "normal";$ submit_vars["styleid"] = "1";$submit_vars["cookietime"] = "315360000";$submit_vars["loginfield"] = "username";$submit_vars["username"] = " ***"; //Your username$submit_vars["password"] = "*****"; //Your password$submit_vars["questionid"] = "0 ";$submit_vars["answer"] = "";$submit_vars["loginsubmit"] = "Submit";$snoopy->submit($submit_url,$submit_vars); if ($snoopy->results){//Get the connection address$snoopy->fetchlinks("http://www.phpoac.com/bbs");$ url=array();$url=$snoopy->results;//print_r($url);foreach ($url as $key=>$value){//match http://www.phpoac.com/bbs/forumdisplay.php?fid=156&sid=VfcqTRThe address is the forum section addressif(!preg_match("/^(http://www.phpoac. com/bbs/forumdisplay.php?fid=)[0-9]*&sid=[a-zA-Z]{6}/i",$value)){unset($url[$key]); }}//print_r($url);//Get the plate array$url, loop access, get the data on the first page of the first module here $i=0;foreach ($url as $key=>$value){if ($i>=1){//Test limitbreak; }else{//Access this module and extract the connection address of the post. For formal access, you need to extract the post paging data, and then extract the post data based on the paging data$snoopy=new Snoopy(); $snoopy->fetchlinks($value);$tie=array();$tie[$i]=$snoopy->results;//print_r($tie); //Convert arrayforeach ($tie[$i] as $key=>$value){//matchhttp://www.phpoac.com/bbs/viewthread .php?tid=68127&extra=page%3D1&page=1&sid=iBLZfKif (!preg_match("/^(http://www.phpoac.com/bbs/viewthread.php?tid=)[0-9]* &extra=page%3D1&page=[0-9]*&sid=[a-zA-Z]{6}/i",$value)){unset($tie[$i][$key]) ;}}//print_r($tie[$i]);//Category arrays, put the content of different pages of the same post into an array$left='' ;//Connect the public address on the left$j=0;$page=array();foreach ($tie[$i] as $key=>$value){$ left=substr($value,0,52);$m=0;foreach ($tie[$i] as $pkey=>$pvalue){//Reorganize the array if (substr($pvalue,0,52)==$left){$page[$j][$m]=$pvalue;$m++;}}$ j++;}//Start removing duplicates//$page=array_unique($page);can only be used for one-dimensional arrays$paget[0]=$page[0] ;$nums=count($page);for ($n=1;$n <$nums;$n++){$paget[$n]=array_diff($page[$n ],$page[$n-1]);}//End of removing duplicate values ​​from multi-dimensional array//Removing empty values ​​from arrayunset($page);$page=array();//Redefine pagearray $page=array_filter($paget);//print_r($page);$u=0;$title=array();$content=array();$ temp='';$tt=array();foreach ($page as $key=>$value){//Peripheral loop, for a postif (is_array( $value)){foreach ($value as $k1=>$v1){//In-page loop, for a post’s Npage$snoopy= new Snoopy();$snoopy->fetch($v1);$temp=$snoopy->results;//Read titleif (!preg_match_all("/ < h2>(.*) /i",$temp,$tt)){echo "no title";exit;}else{$title [$u]=$tt[1][1];}unset($tt);//Read contentif (!preg_match_all("/
(.*)
/i",$temp,$tt))
{print_r($tt); echo "no content1";exit;}else{foreach ($tt[1] as $c=>$c2){$content[$u]. =$c2;}}}}else{//Get the page content directly$snoopy=new Snoopy();$snoopy-> fetch ($value);$temp=$snoopy->results;//Read titleif (!preg_match_all("/

(.*)

/i ",$temp,$tt))
{echo "no title";exit;}else{$title[$u]=$tt[1][1] ;}unset($tt);//Read contentif (!preg_match_all("/
/i",$temp,$tt)){echo "no content2";exit;}else{foreach ($ tt[1] as $c=>$c2){$content[$u].=$c2;}}}$u++;}print_r( $content);}$i++;}}else{echo "login failed";exit;}?>

The above has introduced the introduction and use of snoopy, including aspects of it. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn