Home >php教程 >php手册 >网络爬虫脚本

网络爬虫脚本

WBOY
WBOYOriginal
2016-06-06 20:13:341532browse

最近需要写个脚本程序抓取一些网络数据,于是就有了常见的php脚本;测试代码如下: #!/usr/local/bin/php -q?php/** * Created by PhpStorm. * User: jackqqxu * Date: 14-9-12 * Time: 上午12:34 * 解析一个目录下面的文件,分析所有的静态资源然后下载下来

最近需要写个脚本程序抓取一些网络数据,于是就有了常见的php脚本;测试代码如下:

#!/usr/local/bin/php -q
<?php /**
 * Created by PhpStorm.
 * User: jackqqxu
 * Date: 14-9-12
 * Time: 上午12:34
 *  解析一个目录下面的文件,分析所有的静态资源然后下载下来;
 */
//echo "请输入需要提取的文件路径:\n";
//$path = fread(STDIN, 100);
//echo "程序即将读取 $path 路径下面的文件\n";
//echo "请输入需要提取的文件类型:\n";
//$type = fread(STDIN, 100);
// Open a known directory, and proceed to read its contents
//$path = '/Users/jackqqxu/Desktop/task/game/a_grain_of_truth_files/css/';
$destPath = '/Users/jackqqxu/task/aliyunsvn/health/grain/views/locations/'; //静态文件html
$sourcePath = '/Users/jackqqxu/task/aliyunsvn/health/grain/js/'; //静态文件html
//$baseUrl = 'http://www.zamolski.com/agot/resources/stylesheets/';
$netSourceUrl = 'http://www.zamolski.com/agot/views/locations/'; //现在获取位置信息
//$type = '.css';
$type = '.js';  //很多需要获取定位的位置信息;
$typeLen = strlen($type);
//echo 'r=' . realpath('/Users/jackqqxu/Desktop/task/game/a_grain_of_truth_files/css/../images/ui/frame_h.png') . "\n\n";
//echo "the programe will read the $type from the $path\n";
//if (!is_dir($destPath)) {
//    exec('mkdir -p ' . $destPath);
//}
    if ($dh = opendir($sourcePath)) {
        while (($file = readdir($dh)) !== false) {
            $fileType = filetype($sourcePath . $file);
            if ($fileType != 'file') {
                continue;
            }
//            echo 'f=' . $file . substr($file, strlen($file)-$typeLen) . "\n";
            if (substr($file, strlen($file)-$typeLen) == $type) {   //类型相同
//                echo "filename: $file : filetype: " . filetype($path . $file) . "\n";
                echo '$sourcePath . $file=' . $sourcePath . $file . "\n";
                $fileContentArr = file($sourcePath . $file);
                foreach($fileContentArr as $fileLine) {
//                    if ($fileLine =~ /url\((.*?)\)/){
//                    if (preg_match_all("/url\((.*?)\)/", $fileLine, $matches))  {   //css中通过url获取其他图片;
                    if (preg_match_all("/gotoLocation\(\"(.*?)\"\)/", $fileLine, $matches))  {   //中通过关键词获取其他文件;
//                        print_r($matches);exit;
//                        foreach($matches[1] as $matchImgUrl) {
                        foreach($matches[1] as $matchUrl) {
                            $sourceUrl = $netSourceUrl . $matchUrl . '.html';
                            echo 'n='.$sourceUrl."\n";//exit;
                            $descFile = $destPath . $matchUrl . '.html';
//                            echo 'fs=' . function_exists('realpath');
//                            echo 'ni=' . $newImgFile."\n";//exit;
//                            echo 'mkdir -p=' . dirname($newImgFile);
//                            exec('mkdir -p ' . dirname($newImgFile));
                            $ret = file_put_contents($descFile, file_get_contents($sourceUrl));
                            if ($ret) {
                                echo "文件$descFile 写入成功\n";
//                                exit;
                            }
//                            exit;
                        }
                    }
                }
            }
        }
        closedir($dh);
    }
?>


codingless|网络爬虫脚本 Tags:  

Del.icio.us
codingless|网络爬虫脚本
Facebook
codingless|网络爬虫脚本
TweetThis
codingless|网络爬虫脚本
Digg
codingless|网络爬虫脚本
StumbleUpon
codingless|网络爬虫脚本

Comments:  0 (Zero), Be the first to leave a reply!


You might be interested in this:  

  • codingless|网络爬虫脚本  Ubuntu 安装JRE7的快捷方法(验证有效)
  • codingless|网络爬虫脚本  BigPipe的技术实现【转】
  • codingless|网络爬虫脚本  'insertCell' called on an object that does not implement interface HTMLTableRowElement.
  • codingless|网络爬虫脚本  javascript性能优化-repaint和reflow
  • codingless|网络爬虫脚本  Fiddler工作原理

Copyright © web代码网 [网络爬虫脚本], All Right Reserved. 2014.
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn