Home >Backend Development >PHP Tutorial >对各种编码及压缩的页面内容抓取程序PHP

对各种编码及压缩的页面内容抓取程序PHP

WBOY
WBOYOriginal
2016-06-20 12:30:57892browse

项目中经常用到页面抓取的功能,页面抓取的过程往往会出现两种常见的问题

1、页面编码不统一,本地是utf-8,抓取的页面是gbk等,导致抓取过来的出现乱码

2、有些网站用了压缩技术,针对页面进行压缩,gzip压缩,这也导致抓取结果就异常

经过在网上搜索相关解决方案及本地测试,现整理一下相关功能,便后继使用。同时也放到了github上,地址:https://github.com/lock-upme/Spider

主程序简单说明一下:

先是用file_get_contents抓取,如果抓取不到,则再用Snoopy来抓取,最后进行编码转换。

程序见下:

/** * 抓取页面内容 *  * $obj = new spider() * $result = $obj->spider('http://www.test.com/1.html'); *  * @author lock */class Spider {		/**	 *	抓取页面内容	 *	 * @param string $url	 * @return string	 */	public function spider($url) {		set_time_limit(10);				$result = self::fileGetContents($url);		if (empty($result)) {			$result = self::snoopy($url);		}		if (empty($result)) {			return false;		}				$result = self::array_iconv($result);		if (empty($result)) {			return false;		}				$result = str_replace("\n", "", $result);			return $result;	}		/**	 * 获取页面内容	 * 	 * @param string $url	 * @return string	 */	public function fileGetContents($url) {		//只读2字节  如果为(16进制)1f 8b (10进制)31 139则开启了gzip ;		$file = @fopen($url, 'rb');		$bin = @fread($file, 2);		@fclose($file);		$strInfo = @unpack('C2chars', $bin);		$typeCode = intval($strInfo['chars1'].$strInfo['chars2']);		$url = ($typeCode==31139) ? 'compress.zlib://'.$url:$url; // 三元表达式			return @file_get_contents($url);	}		/**	 * 获取页面内容	 * 	 * @param string $url	 * @return string	 */	public function snoopy($url){		require_once 'Snoopy.class.php';		$snoopy = new Snoopy;		$snoopy->agent = 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36';		$snoopy->_fp_timeout = 10;		$urlSplit = self::urlSimplify($url);		$snoopy->referer = $urlSplit['domain'];		$result = $snoopy->fetch($url);		return  $snoopy->results; 	}		/**	 *对数据进行编码转换(来自网络)	 *	 * @param array/string $data       数组	 * @param string $output    转换后的编码	 * @return 返回编码后的数据	 */	public function array_iconv($data,  $output = 'utf-8') {		$encodeArr = array('UTF-8', 'ASCII', 'GBK', 'GB2312', 'BIG5', 'JIS', 'eucjp-win', 'sjis-win', 'EUC-JP');		$encoded = mb_detect_encoding($data, $encodeArr);			if (empty($encoded)) { $encoded='UTF-8'; }			if (!is_array($data)) {			return @mb_convert_encoding($data, $output, $encoded);		} else {			foreach ($data as $key=>$val) {				$key = self::array_iconv($key, $output);				if (is_array($val)) {					$data[$key] = self::array_iconv($val, $output);				} else {					$data[$key] = @mb_convert_encoding($data, $output, $encoded);				}			}			return $data;		}	}	}


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn