Heim  >  Artikel  >  Backend-Entwicklung  >  php正则抓取整个域名下的图片_PHP教程

php正则抓取整个域名下的图片_PHP教程

WBOY
WBOYOriginal
2016-07-20 11:12:42799Durchsuche

代码出处:jUnion

适用平台:Windows, Linux(Ubuntu),php-5.2.5+,Apache

功能:抓取整个站点的图片,暂无借助php的curl插件开发, 后期完善

配置:config目录下
     domain_name:域名(默认:bizhibar.com)
     request_site:网站网址(默认:http://www.bizhibar.com/)
     request_url:从网站的哪个页面开始(默认:http://www.bizhibar.com/)
     accept_type: 图片类型(默认:gif, bmp, png, ico,  jpg, jpeg)
     save_path:图片保存路径(默认:savefiles/)
     partition_name:图片保存目录名称前缀(默认:img_)
     dir_file_limit: 每个目录容许多少个文件(默认:100)
     serialize_img_size: 当读取了多少个图片地址才缓存到cache目录下的accompImg文件当中,下次继续抓取的时候会忽略这些地址。(默认:30)
     serialize_url_size:与serialize_url_size一样,已读取多少个链接地址才缓存到cache目录
下的overURL,下次继续抓取的时候忽略这些地址。(默认:10)

说明:欢迎诸君批评指教,有任何新问题或者需要改进的地方,请您反馈给我

<?php
set_time_limit(0);
require dirname(__FILE__).DIRECTORY_SEPARATOR.&#39;include&#39;.DIRECTORY_SEPARATOR.&#39;Capture.const.php&#39;;
require __Home__.&#39;include&#39;.__Os__.&#39;Capture.class.php&#39;;

$_cfg = array(
	&#39;site&#39; => __Home__.&#39;config&#39;.__Os__.&#39;capture.site.php&#39;,
	&#39;preg&#39; => __Home__.&#39;config&#39;.__Os__.&#39;capture.preg.php&#39;,
	&#39;accompImg&#39; => __Home__.&#39;cache&#39;.__Os__.&#39;accompImg&#39;,
	&#39;overURL&#39;   => __Home__.&#39;cache&#39;.__Os__.&#39;overURL&#39;
);

$_parse = new Capture( $_cfg );
$_parse->parseQuestUrl();

?>
<?php
/**
 * The main class
 * @author pankai<530911044@qq.com>
 * @date 2013-08-10
 */
class Capture {
	private static $_Config = array();
	
	private static $_CapSite = NULL;
	private static $_CapPreg = NULL;
	
	private static $_overURL = array();
	
	private $_mark = FALSE;
	private static $_markTime = 1;
	/**
	 * initialize the main class: Capture
	 * @param $_cfg array
	 */
	public function __construct( &$_cfg ) {
		self::$_Config = &$_cfg;
		
		self::$_CapSite = require $_cfg[&#39;site&#39;];
		self::$_CapPreg = require $_cfg[&#39;preg&#39;];
		
		foreach( self::$_CapPreg as $_key => $_value ) {
			self::$_CapPreg[$_key] = str_replace( &#39;_request_site&#39;, self::$_CapSite[&#39;request_site&#39;], $_value );
		}
		
		self::import( &#39;file.OperateFile&#39; );
		if( file_exists( $_cfg[&#39;overURL&#39;] ) && filesize( $_cfg[&#39;overURL&#39;] ) > 0 ) {
			$_contents = OperateFile::readText( $_cfg[&#39;overURL&#39;], filesize( $_cfg[&#39;overURL&#39;] ) );
			self::$_overURL = unserialize( $_contents );
		}
		
		self::import(&#39;pivotal.Pivotal&#39;);
		if( file_exists( $_cfg[&#39;accompImg&#39;] ) && filesize( $_cfg[&#39;accompImg&#39;] ) > 0 ) {
			$_contents = OperateFile::readText( $_cfg[&#39;accompImg&#39;], filesize( $_cfg[&#39;accompImg&#39;] ) );
			Pivotal::$_accompImg = unserialize( $_contents );
		}
		
	}
	/**
	 * load class, follow Java pragrammer(package): import com.jUnion.Capture
	 * @param $_class
	 */
	public static function import( $_class ) {
		require_once __Home__.&#39;include&#39;.__Os__.str_replace( &#39;.&#39;, __Os__, $_class ).&#39;.class.php&#39;;
	}
	
	/**
	 * create an instance of Pivotal class
	 * @param $_source
	 */
	private function getCapInstance( &$_source ) {
		$this->_mark = FALSE;
		
		$_Captal = new Pivotal( self::$_Config, $_source );
		$_tagA = $_Captal->parseUrl();
		
		$this->_mark = TRUE;
		
		return $_tagA;
	}
	
	/**
	 * go forward one by one
	 * @param $_tagArr
	 */
	private function roundTagA( &$_tagArr ) {
		if( $_tagArr == NULL ) {
			return;
		}
		$_tagArrLength = count( $_tagArr );
		for( $i = 0; $i < $_tagArrLength; $i ++ ) {
			if( is_array( $_tagArr[ $i ] ) ) {
				$this->roundTagA( $_tagArr[ $i ] );  
			}
			else {
				if( stripos( $_tagArr[$i], self::$_CapSite[&#39;domain_name&#39;] )
					=== FALSE ) {
						continue;
					}
				if( in_array( $_tagArr[$i], self::$_overURL ) ) {
					continue;
				}
				self::$_overURL[] = $_tagArr[$i];
				if( count( self::$_overURL ) % self::$_CapSite[&#39;serialize_url_size&#39;] == 0 ) {
					OperateFile::setText( self::$_Config[&#39;overURL&#39;], serialize( self::$_overURL ) );
				}
				do {
					$_tagA = $this->getCapInstance( Http::get( $_tagArr[$i] ) );
					sleep( self::$_CapSite[&#39;preform_page_time&#39;] * self::$_markTime );
					if( $this->_mark === TRUE ) {
						self::$_markTime = self::$_CapSite[&#39;preform_page_time&#39;];
						break;
					}
					self::$_markTime *= 2;
				} while( true );
				/* parse the main page and return next page */
				$this->roundTagA( $_tagA );
			}
		}
	}
	//www.bkjia.com
	public function parseQuestUrl() {
		self::import(&#39;http.Http&#39;);
		$_round_Arr = $this->getCapInstance( Http::get( self::$_CapSite[&#39;request_url&#39;] ) );
		$this->roundTagA( $_round_Arr ); 
	}
}

?>

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/444554.htmlTechArticle代码出处:jUnion 适用平台:Windows,Linux(Ubuntu),php-5.2.5+,Apache 功能:抓取整个站点的图片,暂无借助php的curl插件开发,后期完善 配置:...
Stellungnahme:
Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn