首页  >  文章  >  后端开发  >  php正则抓取整个域名下的图片_PHP教程

php正则抓取整个域名下的图片_PHP教程

WBOY
WBOY原创
2016-07-20 11:12:42765浏览

代码出处:jUnion

适用平台:Windows, Linux(Ubuntu),php-5.2.5+,Apache

功能:抓取整个站点的图片,暂无借助php的curl插件开发, 后期完善

配置:config目录下
     domain_name:域名(默认:bizhibar.com)
     request_site:网站网址(默认:http://www.bizhibar.com/)
     request_url:从网站的哪个页面开始(默认:http://www.bizhibar.com/)
     accept_type: 图片类型(默认:gif, bmp, png, ico,  jpg, jpeg)
     save_path:图片保存路径(默认:savefiles/)
     partition_name:图片保存目录名称前缀(默认:img_)
     dir_file_limit: 每个目录容许多少个文件(默认:100)
     serialize_img_size: 当读取了多少个图片地址才缓存到cache目录下的accompImg文件当中,下次继续抓取的时候会忽略这些地址。(默认:30)
     serialize_url_size:与serialize_url_size一样,已读取多少个链接地址才缓存到cache目录
下的overURL,下次继续抓取的时候忽略这些地址。(默认:10)

说明:欢迎诸君批评指教,有任何新问题或者需要改进的地方,请您反馈给我

<?php
set_time_limit(0);
require dirname(__FILE__).DIRECTORY_SEPARATOR.&#39;include&#39;.DIRECTORY_SEPARATOR.&#39;Capture.const.php&#39;;
require __Home__.&#39;include&#39;.__Os__.&#39;Capture.class.php&#39;;

$_cfg = array(
	&#39;site&#39; => __Home__.&#39;config&#39;.__Os__.&#39;capture.site.php&#39;,
	&#39;preg&#39; => __Home__.&#39;config&#39;.__Os__.&#39;capture.preg.php&#39;,
	&#39;accompImg&#39; => __Home__.&#39;cache&#39;.__Os__.&#39;accompImg&#39;,
	&#39;overURL&#39;   => __Home__.&#39;cache&#39;.__Os__.&#39;overURL&#39;
);

$_parse = new Capture( $_cfg );
$_parse->parseQuestUrl();

?>
<?php
/**
 * The main class
 * @author pankai<530911044@qq.com>
 * @date 2013-08-10
 */
class Capture {
	private static $_Config = array();
	
	private static $_CapSite = NULL;
	private static $_CapPreg = NULL;
	
	private static $_overURL = array();
	
	private $_mark = FALSE;
	private static $_markTime = 1;
	/**
	 * initialize the main class: Capture
	 * @param $_cfg array
	 */
	public function __construct( &$_cfg ) {
		self::$_Config = &$_cfg;
		
		self::$_CapSite = require $_cfg[&#39;site&#39;];
		self::$_CapPreg = require $_cfg[&#39;preg&#39;];
		
		foreach( self::$_CapPreg as $_key => $_value ) {
			self::$_CapPreg[$_key] = str_replace( &#39;_request_site&#39;, self::$_CapSite[&#39;request_site&#39;], $_value );
		}
		
		self::import( &#39;file.OperateFile&#39; );
		if( file_exists( $_cfg[&#39;overURL&#39;] ) && filesize( $_cfg[&#39;overURL&#39;] ) > 0 ) {
			$_contents = OperateFile::readText( $_cfg[&#39;overURL&#39;], filesize( $_cfg[&#39;overURL&#39;] ) );
			self::$_overURL = unserialize( $_contents );
		}
		
		self::import(&#39;pivotal.Pivotal&#39;);
		if( file_exists( $_cfg[&#39;accompImg&#39;] ) && filesize( $_cfg[&#39;accompImg&#39;] ) > 0 ) {
			$_contents = OperateFile::readText( $_cfg[&#39;accompImg&#39;], filesize( $_cfg[&#39;accompImg&#39;] ) );
			Pivotal::$_accompImg = unserialize( $_contents );
		}
		
	}
	/**
	 * load class, follow Java pragrammer(package): import com.jUnion.Capture
	 * @param $_class
	 */
	public static function import( $_class ) {
		require_once __Home__.&#39;include&#39;.__Os__.str_replace( &#39;.&#39;, __Os__, $_class ).&#39;.class.php&#39;;
	}
	
	/**
	 * create an instance of Pivotal class
	 * @param $_source
	 */
	private function getCapInstance( &$_source ) {
		$this->_mark = FALSE;
		
		$_Captal = new Pivotal( self::$_Config, $_source );
		$_tagA = $_Captal->parseUrl();
		
		$this->_mark = TRUE;
		
		return $_tagA;
	}
	
	/**
	 * go forward one by one
	 * @param $_tagArr
	 */
	private function roundTagA( &$_tagArr ) {
		if( $_tagArr == NULL ) {
			return;
		}
		$_tagArrLength = count( $_tagArr );
		for( $i = 0; $i < $_tagArrLength; $i ++ ) {
			if( is_array( $_tagArr[ $i ] ) ) {
				$this->roundTagA( $_tagArr[ $i ] );  
			}
			else {
				if( stripos( $_tagArr[$i], self::$_CapSite[&#39;domain_name&#39;] )
					=== FALSE ) {
						continue;
					}
				if( in_array( $_tagArr[$i], self::$_overURL ) ) {
					continue;
				}
				self::$_overURL[] = $_tagArr[$i];
				if( count( self::$_overURL ) % self::$_CapSite[&#39;serialize_url_size&#39;] == 0 ) {
					OperateFile::setText( self::$_Config[&#39;overURL&#39;], serialize( self::$_overURL ) );
				}
				do {
					$_tagA = $this->getCapInstance( Http::get( $_tagArr[$i] ) );
					sleep( self::$_CapSite[&#39;preform_page_time&#39;] * self::$_markTime );
					if( $this->_mark === TRUE ) {
						self::$_markTime = self::$_CapSite[&#39;preform_page_time&#39;];
						break;
					}
					self::$_markTime *= 2;
				} while( true );
				/* parse the main page and return next page */
				$this->roundTagA( $_tagA );
			}
		}
	}
	//www.bkjia.com
	public function parseQuestUrl() {
		self::import(&#39;http.Http&#39;);
		$_round_Arr = $this->getCapInstance( Http::get( self::$_CapSite[&#39;request_url&#39;] ) );
		$this->roundTagA( $_round_Arr ); 
	}
}

?>

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/444554.htmlTechArticle代码出处:jUnion 适用平台:Windows,Linux(Ubuntu),php-5.2.5+,Apache 功能:抓取整个站点的图片,暂无借助php的curl插件开发,后期完善 配置:...
声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn