search
HomeBackend DevelopmentPHP TutorialPHP采集类Snoopy抓取图片实例_PHP

用了两天php的Snoopy这个类,发现很好用。获取请求网页里面的所有链接,直接使用fetchlinks就可以,获取所有文本信息使用fetchtext(其内部还是使用正则表达式在进行处理),还有其它较多的功能,如模拟提交表单等。


使用方法:

先下载Snoopy类,下载地址:http://sourceforge.net/projects/snoopy/
先实例化一个对象,然后调用相应的方法即可获取抓取的网页信息

复制代码 代码如下:
include 'snoopy/Snoopy.class.php';
   
$snoopy = new Snoopy();
   
$sourceURL = "http://www.bitsCN.com";
$snoopy->fetchlinks($sourceURL);
   
$a = $snoopy->results;

它并没有提供获取网页中所有图片地址的方法,自己有个需求是要获取一个页面中所有文章列表中图片地址。然后自己就写了一个,主要还是正则那里匹配重要。
复制代码 代码如下:
//匹配图片的正则表达式
 $reTag = "/PHP采集类Snoopy抓取图片实例_PHP/i";


因为需求比较特殊,只需要抓取写死htp://开头的图片(外站的图片可能使得了防盗链,想先抓取到本地)

1.抓取指定网页,并筛选出预期的所有文章地址;

2.循环抓取第一步中的文章地址,然后使用匹配图片的正则表达式进行匹配,获取页面中所有符合规则的图片地址;

3.根据图片后缀和ID(这里只有gif、jpg)保存图片---如果此图片文件存在,先将其删除再保存。

复制代码 代码如下:

    include 'snoopy/Snoopy.class.php';
   
    $snoopy = new Snoopy();
   
    $sourceURL = "http://xxxxx";
    $snoopy->fetchlinks($sourceURL);
   
    $a = $snoopy->results;
    $re = "/d+.html$/";
   
    //过滤获取指定的文件地址请求
    foreach ($a as $tmp) {
        if (preg_match($re, $tmp)) {
            getImgURL($tmp);
        }
    }
   
    function getImgURL($siteName) {
        $snoopy = new Snoopy();
        $snoopy->fetch($siteName);
       
        $fileContent = $snoopy->results;
       
        //匹配图片的正则表达式
        $reTag = "/PHP采集类Snoopy抓取图片实例_PHP/i";
       
        if (preg_match($reTag, $fileContent)) {
            $ret = preg_match_all($reTag, $fileContent, $matchResult);
           
            for ($i = 0, $len = count($matchResult[1]); $i                 saveImgURL($matchResult[1][$i], $matchResult[2][$i]);
            }
        }
    }
   
    function saveImgURL($name, $suffix) {
        $url = $name.".".$suffix;
       
        echo "请求的图片地址:".$url."
";
       
        $imgSavePath = "E:/xxx/style/images/";
        $imgId = preg_replace("/^.+/(d+)$/", "\1", $name);
        if ($suffix == "gif") {
            $imgSavePath .= "emotion";
        } else {
            $imgSavePath .= "topic";
        }
        $imgSavePath .= ("/".$imgId.".".$suffix);
       
        if (is_file($imgSavePath)) {
            unlink($imgSavePath);
            echo "

文件".$imgSavePath."已存在,将被删除

";
        }
       
        $imgFile = file_get_contents($url);
        $flag = file_put_contents($imgSavePath, $imgFile);
       
        if ($flag) {
            echo "

文件".$imgSavePath."保存成功

";
        }
    }
?>

在使用php抓取网页:内容、图片、链接的时候,我觉得最重要的还是正则(根据抓取的内容和指定的规则获取想要的数据),思路其实都比较简单,用到的方法也并不多,也就那几个(而且抓取内容还是直接调用别人写好的类中的方法就可以了)

但之前想过的是php似乎并没有实现如下的方法,比如一个文件中有N行(N很大),需要将其中符合规则的行内容进行替换,如第3行是aaa需要转成bbbbb。一般的需要修改文件时的常见做法:

1.一次读取整个文件(或是逐行读取),然后使用临时文件进行保存最终转换后的结果,再替换原始文件

2.逐行读取,使用fseek控制文件指针的位置,然后fwrite写入

方案1在文件较大时,一次读取不可取(逐行读取,然后写入临时文件再替换原始文件效率感觉也不高),方案2则在被替换的字符串长度小于等于目标值时没问题,但超过了则会有问题,它会“越界”,将下一行的数据也打乱了(不能像JavaScript中有“选区”的概念,使用新的内容进行替换)。

下面是使用方案2做试验的代码:
复制代码 代码如下:
$mode = "r+";
$filename = "d:/file.txt";
$fp = fopen($filename, $mode);
if ($fp) {
 $i = 1;
 while (!feof($fp)) {
    $str = fgets($fp);
    echo $str;
    if ($i == 1) {
      $len = strlen($str);
      fseek($fp, -$len, SEEK_CUR);//指针向前移动
      fwrite($fp, "123");
    }
    i++;
  }
  fclose($fp);
}
?>

先读取一行,此时文件指针其实是指到下一行开头,使用fseek将文件指针回移到上一行起始位置,然后使用fwrite进行替换操作,正因为是替换操作,在不指定长度的情况下,它把影响到下一行的数据,而我想要的是只想针对这一行进行操作,例如删除这一行或是整行只替换为一个1,上面的例子达不到要求,或许是我还没有找到合适的方法…

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
What data can be stored in a PHP session?What data can be stored in a PHP session?May 02, 2025 am 12:17 AM

PHPsessionscanstorestrings,numbers,arrays,andobjects.1.Strings:textdatalikeusernames.2.Numbers:integersorfloatsforcounters.3.Arrays:listslikeshoppingcarts.4.Objects:complexstructuresthatareserialized.

How do you start a PHP session?How do you start a PHP session?May 02, 2025 am 12:16 AM

TostartaPHPsession,usesession_start()atthescript'sbeginning.1)Placeitbeforeanyoutputtosetthesessioncookie.2)Usesessionsforuserdatalikeloginstatusorshoppingcarts.3)RegeneratesessionIDstopreventfixationattacks.4)Considerusingadatabaseforsessionstoragei

What is session regeneration, and how does it improve security?What is session regeneration, and how does it improve security?May 02, 2025 am 12:15 AM

Session regeneration refers to generating a new session ID and invalidating the old ID when the user performs sensitive operations in case of session fixed attacks. The implementation steps include: 1. Detect sensitive operations, 2. Generate new session ID, 3. Destroy old session ID, 4. Update user-side session information.

What are some performance considerations when using PHP sessions?What are some performance considerations when using PHP sessions?May 02, 2025 am 12:11 AM

PHP sessions have a significant impact on application performance. Optimization methods include: 1. Use a database to store session data to improve response speed; 2. Reduce the use of session data and only store necessary information; 3. Use a non-blocking session processor to improve concurrency capabilities; 4. Adjust the session expiration time to balance user experience and server burden; 5. Use persistent sessions to reduce the number of data read and write times.

How do PHP sessions differ from cookies?How do PHP sessions differ from cookies?May 02, 2025 am 12:03 AM

PHPsessionsareserver-side,whilecookiesareclient-side.1)Sessionsstoredataontheserver,aremoresecure,andhandlelargerdata.2)Cookiesstoredataontheclient,arelesssecure,andlimitedinsize.Usesessionsforsensitivedataandcookiesfornon-sensitive,client-sidedata.

How does PHP identify a user's session?How does PHP identify a user's session?May 01, 2025 am 12:23 AM

PHPidentifiesauser'ssessionusingsessioncookiesandsessionIDs.1)Whensession_start()iscalled,PHPgeneratesauniquesessionIDstoredinacookienamedPHPSESSIDontheuser'sbrowser.2)ThisIDallowsPHPtoretrievesessiondatafromtheserver.

What are some best practices for securing PHP sessions?What are some best practices for securing PHP sessions?May 01, 2025 am 12:22 AM

The security of PHP sessions can be achieved through the following measures: 1. Use session_regenerate_id() to regenerate the session ID when the user logs in or is an important operation. 2. Encrypt the transmission session ID through the HTTPS protocol. 3. Use session_save_path() to specify the secure directory to store session data and set permissions correctly.

Where are PHP session files stored by default?Where are PHP session files stored by default?May 01, 2025 am 12:15 AM

PHPsessionfilesarestoredinthedirectoryspecifiedbysession.save_path,typically/tmponUnix-likesystemsorC:\Windows\TemponWindows.Tocustomizethis:1)Usesession_save_path()tosetacustomdirectory,ensuringit'swritable;2)Verifythecustomdirectoryexistsandiswrita

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use