search
HomeBackend DevelopmentPHP Tutorial PHP采撷利器:Snoopy 试用心得

PHP采集利器:Snoopy 试用心得

?

Snoopy是什么? (下载snoopy
Snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
Snoopy的一些特点:
* 方便抓取网页的内容
* 方便抓取网页的文本内容 (去除HTML标签)
* 方便抓取网页的链接
* 支持代理主机
* 支持基本的用户名/密码验证
* 支持设置 user_agent, referer(来路), cookies 和 header content(头文件)
* 支持浏览器转向,并能控制转向深度
* 能把网页中的链接扩展成高质量的url(默认)
* 方便提交数据并且获取返回值
* 支持跟踪HTML框架(v0.92增加)
* 支持再转向的时候传递cookies (v0.92增加)
?
要想了解的更深入些,你自己Google一下吧。下面就给几个简单的例子:
1获取指定url内容
PHP代码
$url = "http://www.taoav.com";   
include("snoopy.php");   
$snoopy = new Snoopy;   
$snoopy->fetch($url); //获取所有内容   
echo $snoopy->results; //显示结果   
$snoopy->fetchtext //获取文本内容(去掉html代码)   
$snoopy->fetchlinks //获取链接   
$snoopy->fetchform //获取表单   
2 表单提交
PHP代码
$formvars["username"] = "admin";   
$formvars["pwd"] = "admin";   
$action = "http://www.taoav.com";//表单提交地址   
$snoopy->submit($action,$formvars);//$formvars为提交的数组   
echo $snoopy->results; //获取表单提交后的 返回的结果     
$snoopy->submittext; //提交后只返回 去除html的 文本   
$snoopy->submitlinks;//提交后只返回 链接   
?既然已经提交的表单 那就可以做很多事情 接下来我们来伪装ip,伪装浏览器
3 伪装
PHP代码
$formvars["username"] = "admin";   
$formvars["pwd"] = "admin";   
$action = "http://www.taoav.com";   
include "snoopy.php";   
$snoopy = new Snoopy;   
$snoopy->cookies["PHPSESSID"] = 'fc106b1918bd522cc863f36890e6fff7'; //伪装sessionid   
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)"; //伪装浏览器   
$snoopy->referer = "http://www.only4.cn"; //伪装来源页地址 http_referer   
$snoopy->rawheaders["Pragma"] = "no-cache"; //cache 的http头信息   
$snoopy->rawheaders["X_FORWARDED_FOR"] = "127.0.0.101"; //伪装ip   
$snoopy->submit($action,$formvars);   
echo $snoopy->results; 
?

  1. 原来我们可以伪装session 伪装浏览器 ,伪装ip, haha 可以做很多事情了。
例如 带验证码,验证ip 投票, 可以不停的投。
ps:这里伪装ip ,其实是伪装http头, 所以一般的通过 REMOTE_ADDR 获取的ip是伪装不了,
反而那些通过http头来获取ip的(可以防止代理的那种) 就可以自己来制造ip。
关于如何验证码 ,简单说下:
首先用普通的浏览器, 查看页面 , 找到验证码所对应的sessionid,
同时记下sessionid和验证码值,
接下来就用snoopy去伪造 。
原理:由于是同一个sessionid 所以取得的验证码和第一次输入的是一样的。
4 有时我们可能需要伪造更多的东西,snoopy完全为我们想到了
PHP代码
$snoopy->proxy_host = "www.only4.cn";   
$snoopy->proxy_port = "8080"; //使用代理      
$snoopy->maxredirs = 2; //重定向次数    
 $snoopy->expandlinks = true; //是否补全链接 在采集的时候经常用到   
// 例如链接为 /images/taoav.gif 可改为它的全链接 http://www.taoav.com/images/taoav.gif,这个地方其实可以在最后输出的时候用ereg_replace函数自己替换 
$snoopy->maxframes = 5 //允许的最大框架数      
//注意抓取框架的时候 $snoopy->results 返回的是一个数组   
 
$snoopy->error //返回报错信息  
?上面的基本用法了解了,下面我就实例演示一次:
PHP代码?
   
//echo var_dump($_SERVER);   
include("Snoopy.class.php");    
$snoopy = new Snoopy;    
$snoopy->agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-
CN; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 FirePHP/0.2.1";//这项是浏览器信
息,前面你用什么浏览器查看cookie,就用那个浏览器的信息(ps:$_SERVER可以查看到浏览器的信息)    
$snoopy->referer = "http://bbs.phpchina.com/index.php";   
$snoopy->expandlinks = true;   
$snoopy->rawheaders["COOKIE"]="__utmz=17229162.1227682761.29.7.utmccn=(referral)|utmcsr=phpchina.com|utmcct=/html/index.html|utmcmd=referral; cdbphpchina_smile=1D2D0D1; cdbphpchina_cookietime=2592000; __utma=233700831.1562900865.1227113506.1229613449.1231233266.16; __utmz=233700831.1231233266.16.8.utmccn=(referral)|utmcsr=localhost:8080|utmcct=/test3.php|utmcmd=referral; __utma=17229162.1877703507.1227113568.1231228465.1231233160.58; uchome_loginuser=sinopf; xscdb_cookietime=2592000; __utmc=17229162; __utmb=17229162; cdbphpchina_sid=EX5w1V; __utmc=233700831; cdbphpchina_visitedfid=17; cdbphpchinaO766uPYGK6OWZaYlvHSuzJIP22VpwEMGnPQAuWCFL9Fd6CHp2e%2FKw0x4bKz0N9lGk; xscdb_auth=8106rAyhKpQL49eMs%2FyhLBf3C6ClZ%2B2idSk4bExJwbQr%2BHSZrVKgqPOttHVr%2B6KLPg3DtWpTMUI4ttqNNVpukUj6ElM; cdbphpchina_onlineusernum=3721";   
  
 
$snoopy->fetch("http://bbs.phpchina.com/forum-17-1.html"); 
$n=ereg_replace("href=\"","href=\"http://bbs.phpchina.com/",$snoopy->results );   
echo ereg_replace("src=\"","src=\"http://bbs.phpchina.com/",$n);   
?>  
?这是模拟登陆PHPCHINA论坛的过程,首先要查看自己浏览器的信
息:echo?var_dump($_SERVER);这句代码可以看到自己浏览器的信息,把?
$_SERVER['HTTP_USER_AGENT']后边的内容复制下来,粘在$snoopy->agent的地方,然后就是要查看自己的
COOKIE了,用自己在论坛的账号登陆论坛后,在浏览器地址栏里输入
javascript:document.write(document.cookie),回车,就可以看到自己的cookie信息,复制粘贴
到$snoopy->rawheaders["COOKIE"]=的后边。(我的cookie信息为了安全起见已经删除了一段内容)


然后再注意:


# $n=ereg_replace("href=\"","href=\"http://bbs.phpchina.com/",$snoopy->results );?


# echo ereg_replace("src=\"","src=\"http://bbs.phpchina.com/",$n);


这两句代码,因为采集到的内容所有的HTML源码地址都是相对链接,所以要替换成绝对链接,这样就可以引用论坛的图片和css样式了。
转载:http://zzdboy1616.blog.163.com/blog/static/430670762009213111712876/?

?

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
What are some common problems that can cause PHP sessions to fail?What are some common problems that can cause PHP sessions to fail?Apr 25, 2025 am 12:16 AM

Reasons for PHPSession failure include configuration errors, cookie issues, and session expiration. 1. Configuration error: Check and set the correct session.save_path. 2.Cookie problem: Make sure the cookie is set correctly. 3.Session expires: Adjust session.gc_maxlifetime value to extend session time.

How do you debug session-related issues in PHP?How do you debug session-related issues in PHP?Apr 25, 2025 am 12:12 AM

Methods to debug session problems in PHP include: 1. Check whether the session is started correctly; 2. Verify the delivery of the session ID; 3. Check the storage and reading of session data; 4. Check the server configuration. By outputting session ID and data, viewing session file content, etc., you can effectively diagnose and solve session-related problems.

What happens if session_start() is called multiple times?What happens if session_start() is called multiple times?Apr 25, 2025 am 12:06 AM

Multiple calls to session_start() will result in warning messages and possible data overwrites. 1) PHP will issue a warning, prompting that the session has been started. 2) It may cause unexpected overwriting of session data. 3) Use session_status() to check the session status to avoid repeated calls.

How do you configure the session lifetime in PHP?How do you configure the session lifetime in PHP?Apr 25, 2025 am 12:05 AM

Configuring the session lifecycle in PHP can be achieved by setting session.gc_maxlifetime and session.cookie_lifetime. 1) session.gc_maxlifetime controls the survival time of server-side session data, 2) session.cookie_lifetime controls the life cycle of client cookies. When set to 0, the cookie expires when the browser is closed.

What are the advantages of using a database to store sessions?What are the advantages of using a database to store sessions?Apr 24, 2025 am 12:16 AM

The main advantages of using database storage sessions include persistence, scalability, and security. 1. Persistence: Even if the server restarts, the session data can remain unchanged. 2. Scalability: Applicable to distributed systems, ensuring that session data is synchronized between multiple servers. 3. Security: The database provides encrypted storage to protect sensitive information.

How do you implement custom session handling in PHP?How do you implement custom session handling in PHP?Apr 24, 2025 am 12:16 AM

Implementing custom session processing in PHP can be done by implementing the SessionHandlerInterface interface. The specific steps include: 1) Creating a class that implements SessionHandlerInterface, such as CustomSessionHandler; 2) Rewriting methods in the interface (such as open, close, read, write, destroy, gc) to define the life cycle and storage method of session data; 3) Register a custom session processor in a PHP script and start the session. This allows data to be stored in media such as MySQL and Redis to improve performance, security and scalability.

What is a session ID?What is a session ID?Apr 24, 2025 am 12:13 AM

SessionID is a mechanism used in web applications to track user session status. 1. It is a randomly generated string used to maintain user's identity information during multiple interactions between the user and the server. 2. The server generates and sends it to the client through cookies or URL parameters to help identify and associate these requests in multiple requests of the user. 3. Generation usually uses random algorithms to ensure uniqueness and unpredictability. 4. In actual development, in-memory databases such as Redis can be used to store session data to improve performance and security.

How do you handle sessions in a stateless environment (e.g., API)?How do you handle sessions in a stateless environment (e.g., API)?Apr 24, 2025 am 12:12 AM

Managing sessions in stateless environments such as APIs can be achieved by using JWT or cookies. 1. JWT is suitable for statelessness and scalability, but it is large in size when it comes to big data. 2.Cookies are more traditional and easy to implement, but they need to be configured with caution to ensure security.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools