Home  >  Article  >  Backend Development  >  最快的速度获取网页所有图片的长和宽。

最快的速度获取网页所有图片的长和宽。

WBOY
WBOYOriginal
2016-06-23 13:46:38967browse

不知道大家有没有玩过 http://pinterest.com ?注册后,它有一个 add a pin, 当你提交一个网站的URL后,按Find Images时,它可以查找你提交网页上所有图片的(并进行长和宽条件的筛选),整个过程一般在10秒左右。

最近想模仿它,做一个小功能组件。已经摒弃掉万恶的 getimagesize() (需要48.64秒),换用 imagecreatefromstring()(还是需要26.13秒),和它10秒左右的成绩,简直是天壤之别。

要考虑 TCP 连接数,要做到服务器资源最省化,还要考虑执行时间最少化。求助万能的大虾们,如何继续优化代码?可以跑的更快些。

function ranger($url){	$headers = array( "Range: bytes=0-32768" );	$curl = curl_init($url);	curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);	curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);	return curl_exec($curl);	curl_close($curl);}//curl设置require dirname(__FILE__) . '/simple_html_dom.php'; //采用simple_html_dom.php分析HTML nod$url = 'http://www.huffingtonpost.com/';$html = file_get_html($url);if($html->find('img')){	foreach($html->find('img') as $element) {		$raw = ranger($element->src);		$im = @imagecreatefromstring($raw);		$width = @imagesx($im);		$height = @imagesy($im);		if($width>=200||$height>=200){			echo $element;//得出长大于大于200,宽大于等于200的图片		}	}}

 


回复讨论(解决方案)

也许能走个弯路,减轻服务器网络压力。
服务器负责解析HTML数据,统计image标签信息,最后将收集的文本数据送回客户端。
加载图片由客户端来完成,只需读取width,height属性,就完全可以获取图片的原始大小。
好处多多,不过可能的麻烦是防盗链

读取并解析 2.8秒
读取图片(138个) 27秒
找到 7 个

仅从优化代码出发,应该油水不大
可考虑多路并发

读取并解析 3.6秒
启动读取图片进程(138个) 1.3秒
结果文件中记录数 7 个

http://s.huffpost.com/images/v/logos/v4/tagline.gifhttp://s.huffpost.com/images/v/logos/v4/homepage.gif?v9http://i.huffpost.com/gen/559399/thumbs/r-OLBERMANN-huge.jpghttp://s.huffpost.com/images/facebook_promo_connect.png?3http://images.huffingtonpost.com/2012-04-04-michaeljfoxmarlo2SECOND.jpghttp://images.huffingtonpost.com/2012-04-05-Screenshot20120405at9.40.24AM.jpghttp://i.huffpost.com/gen/557914/thumbs/s-SCORSESE-large300.jpg


原循环改为
    foreach($html->find('img') as $element) {       tenor("tenorcall.php?v=$element->src");    }}


tenorcall.php
function ranger($url){    $headers = array( "Range: bytes=0-32768" );    $curl = curl_init($url);    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);    return curl_exec($curl);    curl_close($curl);}//curl设置        $raw = ranger($_GET['v']);        $im = @imagecreatefromstring($raw);        $width = @imagesx($im);        $height = @imagesy($im);        if($width>=200||$height>=200){            file_put_contents('tenorcall.txt', $_GET['v'].PHP_EOL, FILE_APPEND );//得出长大于大于200,宽大于等于200的图片        }


/** * 函数 tenor * 功能 启动一个url,但不等待返回 * 参数 $page,待执行的页面程序 * 返回 无 **/if(! function_exists('tenor')):function tenor($page) {        $host = $_SERVER["HTTP_HOST"];        $fp = fsockopen($host, 80, $errno, $errmsg);        if(!$fp) {                echo "$errstr ($errno)<br>\n";        } else {                fputs($fp,"GET /$page HTTP/1.0\nHost: $host\n\n");                fclose($fp);        }}endif;


代码还是原代码,非但没减少,反而增加了
但因为是并发,所以速度明显提高

值得注意的是:tenor 函数在某些web服务器中不能稳定的运行(比如iis6)原因不明

我觉得,让客户端加载的方案是可行的,

客户端再将符合要求的图片信息提交给服务器,服务器端再验证一次后保存。。。


另外32768是怎么得来的?1-200不够吗

每天回帖即可获得10分可用分

学习! 是用PHP获取图片url后直接读取图片的头信息吗?

学习了,可能以后会用到

好深奥

估计很快就会用到了……

pinterest那个pin功能创意很好,而且技术很简单,就是书签一串js代码,然后你点这个书签就相当于往当前页面文档append入一个js文件,这个js文件怎么写,就很简单了,主要就是遍历document.getElementsByTagName('img')

啊,貌似LZ说的是另一个功能,我看错了。

to xuzuning: 我用的是apache2,不是iis6
138个照片并发,是不是就消耗了138个连接数?是否需要修改php.ini,增加连接数?此外,CPU和内存开销如何?谢谢。

to dream1206,yiwusuo,amani11: 刚才又琢磨了一下他的添加。貌似提交网址后,第一时间(1-3秒内)先返回一张图片,然后在(7-9秒后)返回剩余的图片信息。应该是你们说的那种PHP只获取所有的图片地址,JS判断图片大小,甚至ajax并发传输到第二个PHP页面,判断图片长宽后返回数据。

但是不论如何,并发是少不了的。用JS并发和直接PHP并发,2者从资源消耗角度来比,哪个会更少?谢谢。

138个照片并发,是不是就消耗了138个连接数


是否需要修改php.ini,增加连接数
否,连接是向外的,如果要改,也是对方改

CPU和内存开销如何
这个不太好测试

,关于使用 js 判断的问题,由于他们没有给出代码,无法测试
自己写了两个方案都不理想,也就作罢了

用JS并发和直接PHP并发,2者从资源消耗角度来比,哪个会更少
资源消耗角度来比 都一样,都要完整的加载图片
不过前者是消耗客户端资源,后者是消耗服务器端资源
另外浏览器的机制不很了解,是否真的是并发也未可知

谢谢 xuzuning的详解。 继续讨论。 另一个论坛上同有一位高手解答,转帖代码。

require 'simple_html_dom.php';$url = 'http://www.huffingtonpost.com';$html = file_get_html ( $url );$nodes = array ();$start = microtime ();$res = array ();if ($html->find ( 'img' )) {    foreach ( $html->find ( 'img' ) as $element ) {        if (startsWith ( $element->src, "/" )) {            $element->src = $url . $element->src;        }        if (! startsWith ( $element->src, "http" )) {            $element->src = $url . "/" . $element->src;        }        $nodes [] = $element->src;    }}echo "<pre class="brush:php;toolbar:false">";print_r ( imageDownload ( $nodes, 200, 200 ) );echo "<h1>", microtime () - $start, "</h1>";function imageDownload($nodes, $maxHeight = 0, $maxWidth = 0) {    $mh = curl_multi_init ();    $curl_array = array ();    foreach ( $nodes as $i => $url ) {        $curl_array [$i] = curl_init ( $url );        curl_setopt ( $curl_array [$i], CURLOPT_RETURNTRANSFER, true );        curl_setopt ( $curl_array [$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)' );        curl_setopt ( $curl_array [$i], CURLOPT_CONNECTTIMEOUT, 5 );        curl_setopt ( $curl_array [$i], CURLOPT_TIMEOUT, 15 );        curl_multi_add_handle ( $mh, $curl_array [$i] );    }    $running = NULL;    do {        usleep ( 10000 );        curl_multi_exec ( $mh, $running );    } while ( $running > 0 );    $res = array ();    foreach ( $nodes as $i => $url ) {        $curlErrorCode = curl_errno ( $curl_array [$i] );        if ($curlErrorCode === 0) {            $info = curl_getinfo ( $curl_array [$i] );            $ext = getExtention ( $info ['content_type'] );            if ($info ['content_type'] !== null) {                $temp = "temp/img" . md5 ( mt_rand () ) . $ext;                touch ( $temp );                $imageContent = curl_multi_getcontent ( $curl_array [$i] );                file_put_contents ( $temp, $imageContent );                if ($maxHeight == 0 || $maxWidth == 0) {                    $res [] = $temp;                } else {                    $size = getimagesize ( $temp );                    if ($size [0] >= $maxHeight && $size [0] >= $maxWidth) {                        $res [] = $temp;                    } else {                        unlink ( $temp );                    }                }            }        }        curl_multi_remove_handle ( $mh, $curl_array [$i] );        curl_close ( $curl_array [$i] );    }    curl_multi_close ( $mh );    return $res;}function getExtention($type) {    $type = strtolower ( $type );    switch ($type) {        case "image/gif" :            return ".gif";            break;        case "image/png" :            return ".png";            break;        case "image/jpeg" :            return ".jpg";            break;        default :            return ".img";            break;    }}function startsWith($str, $prefix) {    $temp = substr ( $str, 0, strlen ( $prefix ) );    $temp = strtolower ( $temp );    $prefix = strtolower ( $prefix );    return ($temp == $prefix);}


执行时间4.8秒。但是 if(in_array($absUrl, $visited))continue; 这行报错。 Warning: in_array() expects parameter 2 to be array, null。 此外最终图片地址并非网络地址,而是本地缓存地址。

进一步测试研究。

这段代码在我这里大约 1.8秒,不计算 file_get_html ( $url ) 时间

$res [] = $url ;//$temp;
这样就是网络地址了

他是保存为本地文件后用 getimagesize 获取尺寸的

他应该是通过 curl 并发的,这个机制我不太了解

但是 if(in_array($absUrl, $visited))continue; 这行报错。 Warning: in_array() expects parameter 2 to be array, null。

他的代码中并没有你说的出错的代码
应该是 file_get_html 在报错吧
file_get_html 使用 file_get_contents 读取 url 成功率较低
经常要刷两三次才可独到数据

JS可以通过获取图片的头部信息,而直接获取到图片的高度,
这种方式比用图片加载完成以后在获取他的搞定效率至少快10倍以上,
之前记得有在一个播客里面看到过这么个帖子来着,
没收藏,这一时半会的找不到了,郁闷啊~

刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121

很受用!

不错,PHP强大!!

学习了!
每天回帖即可获得10分可用分

刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121

你就不能做个示例代码吗?  

刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
    GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpost.com HTTP/1.1
返回的信息是一个json对象:

images: [http://s.huffpost.com/images/v/logos/v4/homepage.gif?v9,…]0: "http://s.huffpost.com/images/v/logos/v4/homepage.gif?v9"1: "http://s.huffpost.com/images/v/logos/v4/tagline.gif"2: "http://s.huffpost.com/images/splash/t_mini-a.png"3: "http://s.huffpost.com/images/splash/t_mini-a.png"4: "http://s.huffpost.com/images/splash/t_mini-a.png"5: "http://s.huffpost.com/images/splash/t_mini-a.png"6: "http://s.huffpost.com/images/splash/t_mini-a.png"7: "http://s.huffpost.com/images/splash/t_mini-a.png"8: "http://s.huffpost.com/images/splash/t_mini-a.png"9: "http://s.huffpost.com/images/splash/t_mini-a.png"10: "http://s.huffpost.com/images/splash/t_mini-a.png"11: "http://s.huffpost.com/images/splash/t_mini-a.png"12: "http://s.huffpost.com/images/splash/t_mini-a.png"13: "http://s.huffpost.com/images/splash/t_mini-a.png"14: "http://s.huffpost.com/images/splash/t_mini-a.png"15: "http://s.huffpost.com/images/splash/t_mini-a.png"16: "http://s.huffpost.com/images/splash/t_mini-a.png"17: "http://i.huffpost.com/gen/560770/thumbs/r-GSA-LAS-VEGAS-VIDEO-huge.jpg"18: "http://s.huffpost.com/images/webslice12x12.png"19: "http://s.huffpost.com/images/v/blog_column.png"20: "http://s.huffpost.com/contributors/gary-hart/headshot.jpg"21: "http://www.huffingtonpost.com/images/trans.gif"22: "http://www.huffingtonpost.com/images/trans.gif"23: "http://www.huffingtonpost.com/images/trans.gif"24: "http://images.huffingtonpost.com/2012-04-06-campbellguitar.jpg"25: "http://www.huffingtonpost.com/images/trans.gif"26: "http://www.huffingtonpost.com/images/trans.gif"27: "http://www.huffingtonpost.com/images/trans.gif"28: "http://www.huffingtonpost.com/images/trans.gif"29: "http://www.huffingtonpost.com/images/trans.gif"30: "http://www.huffingtonpost.com/images/trans.gif"31: "http://images.huffingtonpost.com/2012-04-06-Screenshot20120406at7.09.17PM.jpg"32: "http://www.huffingtonpost.com/images/trans.gif"33: "http://www.huffingtonpost.com/images/trans.gif"34: "http://www.huffingtonpost.com/images/trans.gif"35: "http://www.huffingtonpost.com/images/trans.gif"36: "http://www.huffingtonpost.com/images/trans.gif"37: "http://www.huffingtonpost.com/images/trans.gif"38: "http://www.huffingtonpost.com/images/trans.gif"39: "http://www.huffingtonpost.com/images/trans.gif"40: "http://www.huffingtonpost.com/images/trans.gif"41: "http://www.huffingtonpost.com/images/trans.gif"42: "http://www.huffingtonpost.com/images/trans.gif"43: "http://www.huffingtonpost.com/images/trans.gif"44: "http://www.huffingtonpost.com/images/trans.gif"45: "http://www.huffingtonpost.com/images/trans.gif"46: "http://www.huffingtonpost.com/images/trans.gif"47: "http://www.huffingtonpost.com/images/trans.gif"48: "http://www.huffingtonpost.com/images/trans.gif"49: "http://www.huffingtonpost.com/images/trans.gif"50: "http://www.huffingtonpost.com/images/trans.gif"51: "http://www.huffingtonpost.com/images/trans.gif"52: "http://www.huffingtonpost.com/images/trans.gif"53: "http://www.huffingtonpost.com/images/trans.gif"54: "http://www.huffingtonpost.com/images/trans.gif"55: "http://www.huffingtonpost.com/images/trans.gif"56: "http://www.huffingtonpost.com/images/trans.gif"57: "http://www.huffingtonpost.com/images/trans.gif"58: "http://www.huffingtonpost.com/images/trans.gif"59: "http://www.huffingtonpost.com/images/trans.gif"60: "http://www.huffingtonpost.com/images/trans.gif"61: "http://www.huffingtonpost.com/images/trans.gif"62: "http://www.huffingtonpost.com/images/trans.gif"63: "http://www.huffingtonpost.com/images/trans.gif"64: "http://www.huffingtonpost.com/images/trans.gif"65: "http://www.huffingtonpost.com/images/trans.gif"66: "http://www.huffingtonpost.com/images/trans.gif"67: "http://www.huffingtonpost.com/images/trans.gif"68: "http://www.huffingtonpost.com/images/trans.gif"69: "http://www.huffingtonpost.com/images/trans.gif"70: "http://www.huffingtonpost.com/images/trans.gif"71: "http://www.huffingtonpost.com/images/trans.gif"72: "http://www.huffingtonpost.com/images/trans.gif"73: "http://www.huffingtonpost.com/images/trans.gif"74: "http://www.huffingtonpost.com/images/trans.gif"75: "http://s.huffpost.com/images/blank.gif"76: "http://s.huffpost.com/images/blank.gif"77: "http://s.huffpost.com/images/blank.gif"78: "http://s.huffpost.com/images/blank.gif"79: "http://s.huffpost.com/images/blank.gif"80: "http://s.huffpost.com/images/blank.gif"81: "http://s.huffpost.com/images/blank.gif"82: "http://s.huffpost.com/images/facebook_promo_connect.png?3"83: "http://s.huffpost.com/images/loader.gif"84: "http://www.huffingtonpost.com/images/trans.gif"85: "http://www.huffingtonpost.com/images/trans.gif"86: "http://www.huffingtonpost.com/images/trans.gif"87: "http://www.huffingtonpost.com/images/trans.gif"88: "http://www.huffingtonpost.com/images/trans.gif"89: "http://www.huffingtonpost.com/images/trans.gif"90: "http://s.huffpost.com/contributors/gary-hart/headshot.jpg"91: "http://s.huffpost.com/contributors/mike-campbell/headshot.jpg"92: "http://s.huffpost.com/contributors/roma-downey/headshot.jpg"93: "http://s.huffpost.com/contributors/gavin-newsom/headshot.jpg"94: "http://s.huffpost.com/contributors/sarah-shourd/headshot.jpg"95: "http://s.huffpost.com/contributors/jacqueline-novogratz/headshot.jpg"96: "http://s.huffpost.com/contributors/peggy-drexler/headshot.jpg"97: "http://s.huffpost.com/contributors/mohamed-a-elerian/headshot.jpg"98: "http://s.huffpost.com/contributors/bill-mckibben/headshot.jpg"99: "http://s.huffpost.com/contributors/marlo-thomas/headshot.jpg"100: "http://www.huffingtonpost.com/images/v/something_to_say_button.png"101: "http://www.huffingtonpost.com/images/trans.gif"102: "http://www.huffingtonpost.com/images/trans.gif"103: "http://www.huffingtonpost.com/images/trans.gif"104: "http://www.huffingtonpost.com/images/trans.gif"105: "http://www.huffingtonpost.com/images/trans.gif"106: "http://www.huffingtonpost.com/images/trans.gif"107: "http://www.huffingtonpost.com/images/trans.gif"108: "http://www.huffingtonpost.com/images/trans.gif"109: "http://www.huffingtonpost.com/images/trans.gif"110: "http://www.huffingtonpost.com/images/trans.gif"111: "http://www.huffingtonpost.com/images/trans.gif"112: "http://www.huffingtonpost.com/images/trans.gif"113: "http://www.huffingtonpost.com/images/trans.gif"114: "http://www.huffingtonpost.com/images/trans.gif"115: "http://www.huffingtonpost.com/images/trans.gif"116: "http://www.huffingtonpost.com/images/trans.gif"117: "http://www.huffingtonpost.com/images/trans.gif"118: "http://www.huffingtonpost.com/images/trans.gif"119: "http://www.huffingtonpost.com/images/trans.gif"120: "http://www.huffingtonpost.com/images/trans.gif"121: "http://www.huffingtonpost.com/images/trans.gif"122: "http://www.huffingtonpost.com/images/trans.gif"123: "http://www.huffingtonpost.com/images/trans.gif"124: "http://www.huffingtonpost.com/images/trans.gif"125: "http://www.huffingtonpost.com/images/trans.gif"126: "http://www.huffingtonpost.com/images/trans.gif"127: "http://www.huffingtonpost.com/images/trans.gif"128: "http://www.huffingtonpost.com/images/trans.gif"129: "http://www.huffingtonpost.com/images/trans.gif"130: "http://www.huffingtonpost.com/images/trans.gif"131: "http://www.huffingtonpost.com/images/trans.gif"132: "http://www.huffingtonpost.com/images/trans.gif"133: "http://www.huffingtonpost.com/images/trans.gif"134: "http://b.scorecardresearch.com/p?c1=2&c2=6723616&c3=&c4=&c5=front&c6=&c15=&cj=1"135: "http://www.huffingtonpost.com//secure-us.imrworldwide.com/cgi-bin/m?ci=us-703240h&cg=0&cc=1&ts=noscript"136: "http://vertical-stats.huffpost.com/?-1&&"137: "http://www.huffingtonpost.com//pixel.quantserve.com/pixel/p-6fTutip1SMLM2.gif?labels=Home"images_count: 138redirected: falsestatus: "success"title: "Breaking News and Opinion on The Huffington Post"type: "text/html; charset=utf-8"


几乎是服务器返回的同时,浏览器开始加载图片。chrome监控如下。黄色的那个线表示提交url获取图片资源,后面的就都是加载图片了,加载的速度还是取决于我这儿的网络。

由于http://pinterest.com/的JS代码经过压缩,且使用了JQuery,所以找起来特别费劲。其实具体怎么干就很简单,谁都能想到。遍历json数据,创建img标签对象,设置src属性,保存对象。剩下的浏览器就会自己完成。

带宽也是个问题

刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
   GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpost.com HTTP/1.……



这方面对象保存在哪里呢?cookie里,还是服务器里的历史文件?此外jquery如何多线程获取图片长和宽?

引用 38 楼  的回复:

刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpo……


什么对象? 
你是说服务器返回的image链接的数据吗?不用保存呀。收到ajax请求后解析返回数据就完了
另外,浏览器加载外部资源都是异步。也就是说,不管是不是用的JQuery,都是异步加载的,相互不会影响。和老大写的php端的差不多。

引用 29 楼  的回复:

刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121
你就不能做个示例代码吗?


http://www.planeart.cn/demo/imgReady/

单元以后能看懂

楼主辛苦支持一下。。

楼主辛苦支持一下。。



+1+1+1+1+1+1
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn