Home > Article > Backend Development > 最快的速度获取网页所有图片的长和宽。
不知道大家有没有玩过 http://pinterest.com ?注册后,它有一个 add a pin, 当你提交一个网站的URL后,按Find Images时,它可以查找你提交网页上所有图片的(并进行长和宽条件的筛选),整个过程一般在10秒左右。
最近想模仿它,做一个小功能组件。已经摒弃掉万恶的 getimagesize() (需要48.64秒),换用 imagecreatefromstring()(还是需要26.13秒),和它10秒左右的成绩,简直是天壤之别。
要考虑 TCP 连接数,要做到服务器资源最省化,还要考虑执行时间最少化。求助万能的大虾们,如何继续优化代码?可以跑的更快些。
function ranger($url){ $headers = array( "Range: bytes=0-32768" ); $curl = curl_init($url); curl_setopt($curl, CURLOPT_HTTPHEADER, $headers); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); return curl_exec($curl); curl_close($curl);}//curl设置require dirname(__FILE__) . '/simple_html_dom.php'; //采用simple_html_dom.php分析HTML nod$url = 'http://www.huffingtonpost.com/';$html = file_get_html($url);if($html->find('img')){ foreach($html->find('img') as $element) { $raw = ranger($element->src); $im = @imagecreatefromstring($raw); $width = @imagesx($im); $height = @imagesy($im); if($width>=200||$height>=200){ echo $element;//得出长大于大于200,宽大于等于200的图片 } }}
也许能走个弯路,减轻服务器网络压力。
服务器负责解析HTML数据,统计image标签信息,最后将收集的文本数据送回客户端。
加载图片由客户端来完成,只需读取width,height属性,就完全可以获取图片的原始大小。
好处多多,不过可能的麻烦是防盗链
读取并解析 2.8秒
读取图片(138个) 27秒
找到 7 个
仅从优化代码出发,应该油水不大
可考虑多路并发
读取并解析 3.6秒
启动读取图片进程(138个) 1.3秒
结果文件中记录数 7 个
http://s.huffpost.com/images/v/logos/v4/tagline.gifhttp://s.huffpost.com/images/v/logos/v4/homepage.gif?v9http://i.huffpost.com/gen/559399/thumbs/r-OLBERMANN-huge.jpghttp://s.huffpost.com/images/facebook_promo_connect.png?3http://images.huffingtonpost.com/2012-04-04-michaeljfoxmarlo2SECOND.jpghttp://images.huffingtonpost.com/2012-04-05-Screenshot20120405at9.40.24AM.jpghttp://i.huffpost.com/gen/557914/thumbs/s-SCORSESE-large300.jpg
foreach($html->find('img') as $element) { tenor("tenorcall.php?v=$element->src"); }}
function ranger($url){ $headers = array( "Range: bytes=0-32768" ); $curl = curl_init($url); curl_setopt($curl, CURLOPT_HTTPHEADER, $headers); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); return curl_exec($curl); curl_close($curl);}//curl设置 $raw = ranger($_GET['v']); $im = @imagecreatefromstring($raw); $width = @imagesx($im); $height = @imagesy($im); if($width>=200||$height>=200){ file_put_contents('tenorcall.txt', $_GET['v'].PHP_EOL, FILE_APPEND );//得出长大于大于200,宽大于等于200的图片 }
/** * 函数 tenor * 功能 启动一个url,但不等待返回 * 参数 $page,待执行的页面程序 * 返回 无 **/if(! function_exists('tenor')):function tenor($page) { $host = $_SERVER["HTTP_HOST"]; $fp = fsockopen($host, 80, $errno, $errmsg); if(!$fp) { echo "$errstr ($errno)<br>\n"; } else { fputs($fp,"GET /$page HTTP/1.0\nHost: $host\n\n"); fclose($fp); }}endif;
我觉得,让客户端加载的方案是可行的,
客户端再将符合要求的图片信息提交给服务器,服务器端再验证一次后保存。。。
另外32768是怎么得来的?1-200不够吗
每天回帖即可获得10分可用分
学习! 是用PHP获取图片url后直接读取图片的头信息吗?
学习了,可能以后会用到
好深奥
估计很快就会用到了……
pinterest那个pin功能创意很好,而且技术很简单,就是书签一串js代码,然后你点这个书签就相当于往当前页面文档append入一个js文件,这个js文件怎么写,就很简单了,主要就是遍历document.getElementsByTagName('img')
啊,貌似LZ说的是另一个功能,我看错了。
to xuzuning: 我用的是apache2,不是iis6
138个照片并发,是不是就消耗了138个连接数?是否需要修改php.ini,增加连接数?此外,CPU和内存开销如何?谢谢。
to dream1206,yiwusuo,amani11: 刚才又琢磨了一下他的添加。貌似提交网址后,第一时间(1-3秒内)先返回一张图片,然后在(7-9秒后)返回剩余的图片信息。应该是你们说的那种PHP只获取所有的图片地址,JS判断图片大小,甚至ajax并发传输到第二个PHP页面,判断图片长宽后返回数据。
但是不论如何,并发是少不了的。用JS并发和直接PHP并发,2者从资源消耗角度来比,哪个会更少?谢谢。
138个照片并发,是不是就消耗了138个连接数
对
是否需要修改php.ini,增加连接数
否,连接是向外的,如果要改,也是对方改
CPU和内存开销如何
这个不太好测试
又,关于使用 js 判断的问题,由于他们没有给出代码,无法测试
自己写了两个方案都不理想,也就作罢了
用JS并发和直接PHP并发,2者从资源消耗角度来比,哪个会更少
资源消耗角度来比 都一样,都要完整的加载图片
不过前者是消耗客户端资源,后者是消耗服务器端资源
另外浏览器的机制不很了解,是否真的是并发也未可知
谢谢 xuzuning的详解。 继续讨论。 另一个论坛上同有一位高手解答,转帖代码。
require 'simple_html_dom.php';$url = 'http://www.huffingtonpost.com';$html = file_get_html ( $url );$nodes = array ();$start = microtime ();$res = array ();if ($html->find ( 'img' )) { foreach ( $html->find ( 'img' ) as $element ) { if (startsWith ( $element->src, "/" )) { $element->src = $url . $element->src; } if (! startsWith ( $element->src, "http" )) { $element->src = $url . "/" . $element->src; } $nodes [] = $element->src; }}echo "<pre class="brush:php;toolbar:false">";print_r ( imageDownload ( $nodes, 200, 200 ) );echo "<h1>", microtime () - $start, "</h1>";function imageDownload($nodes, $maxHeight = 0, $maxWidth = 0) { $mh = curl_multi_init (); $curl_array = array (); foreach ( $nodes as $i => $url ) { $curl_array [$i] = curl_init ( $url ); curl_setopt ( $curl_array [$i], CURLOPT_RETURNTRANSFER, true ); curl_setopt ( $curl_array [$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)' ); curl_setopt ( $curl_array [$i], CURLOPT_CONNECTTIMEOUT, 5 ); curl_setopt ( $curl_array [$i], CURLOPT_TIMEOUT, 15 ); curl_multi_add_handle ( $mh, $curl_array [$i] ); } $running = NULL; do { usleep ( 10000 ); curl_multi_exec ( $mh, $running ); } while ( $running > 0 ); $res = array (); foreach ( $nodes as $i => $url ) { $curlErrorCode = curl_errno ( $curl_array [$i] ); if ($curlErrorCode === 0) { $info = curl_getinfo ( $curl_array [$i] ); $ext = getExtention ( $info ['content_type'] ); if ($info ['content_type'] !== null) { $temp = "temp/img" . md5 ( mt_rand () ) . $ext; touch ( $temp ); $imageContent = curl_multi_getcontent ( $curl_array [$i] ); file_put_contents ( $temp, $imageContent ); if ($maxHeight == 0 || $maxWidth == 0) { $res [] = $temp; } else { $size = getimagesize ( $temp ); if ($size [0] >= $maxHeight && $size [0] >= $maxWidth) { $res [] = $temp; } else { unlink ( $temp ); } } } } curl_multi_remove_handle ( $mh, $curl_array [$i] ); curl_close ( $curl_array [$i] ); } curl_multi_close ( $mh ); return $res;}function getExtention($type) { $type = strtolower ( $type ); switch ($type) { case "image/gif" : return ".gif"; break; case "image/png" : return ".png"; break; case "image/jpeg" : return ".jpg"; break; default : return ".img"; break; }}function startsWith($str, $prefix) { $temp = substr ( $str, 0, strlen ( $prefix ) ); $temp = strtolower ( $temp ); $prefix = strtolower ( $prefix ); return ($temp == $prefix);}
这段代码在我这里大约 1.8秒,不计算 file_get_html ( $url ) 时间
$res [] = $url ;//$temp;
这样就是网络地址了
他是保存为本地文件后用 getimagesize 获取尺寸的
他应该是通过 curl 并发的,这个机制我不太了解
但是 if(in_array($absUrl, $visited))continue; 这行报错。 Warning: in_array() expects parameter 2 to be array, null。
他的代码中并没有你说的出错的代码
应该是 file_get_html 在报错吧
file_get_html 使用 file_get_contents 读取 url 成功率较低
经常要刷两三次才可独到数据
JS可以通过获取图片的头部信息,而直接获取到图片的高度,
这种方式比用图片加载完成以后在获取他的搞定效率至少快10倍以上,
之前记得有在一个播客里面看到过这么个帖子来着,
没收藏,这一时半会的找不到了,郁闷啊~
刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121
很受用!
不错,PHP强大!!
学习了!
每天回帖即可获得10分可用分
刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121
刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpost.com HTTP/1.1
返回的信息是一个json对象:
images: [http://s.huffpost.com/images/v/logos/v4/homepage.gif?v9,…]0: "http://s.huffpost.com/images/v/logos/v4/homepage.gif?v9"1: "http://s.huffpost.com/images/v/logos/v4/tagline.gif"2: "http://s.huffpost.com/images/splash/t_mini-a.png"3: "http://s.huffpost.com/images/splash/t_mini-a.png"4: "http://s.huffpost.com/images/splash/t_mini-a.png"5: "http://s.huffpost.com/images/splash/t_mini-a.png"6: "http://s.huffpost.com/images/splash/t_mini-a.png"7: "http://s.huffpost.com/images/splash/t_mini-a.png"8: "http://s.huffpost.com/images/splash/t_mini-a.png"9: "http://s.huffpost.com/images/splash/t_mini-a.png"10: "http://s.huffpost.com/images/splash/t_mini-a.png"11: "http://s.huffpost.com/images/splash/t_mini-a.png"12: "http://s.huffpost.com/images/splash/t_mini-a.png"13: "http://s.huffpost.com/images/splash/t_mini-a.png"14: "http://s.huffpost.com/images/splash/t_mini-a.png"15: "http://s.huffpost.com/images/splash/t_mini-a.png"16: "http://s.huffpost.com/images/splash/t_mini-a.png"17: "http://i.huffpost.com/gen/560770/thumbs/r-GSA-LAS-VEGAS-VIDEO-huge.jpg"18: "http://s.huffpost.com/images/webslice12x12.png"19: "http://s.huffpost.com/images/v/blog_column.png"20: "http://s.huffpost.com/contributors/gary-hart/headshot.jpg"21: "http://www.huffingtonpost.com/images/trans.gif"22: "http://www.huffingtonpost.com/images/trans.gif"23: "http://www.huffingtonpost.com/images/trans.gif"24: "http://images.huffingtonpost.com/2012-04-06-campbellguitar.jpg"25: "http://www.huffingtonpost.com/images/trans.gif"26: "http://www.huffingtonpost.com/images/trans.gif"27: "http://www.huffingtonpost.com/images/trans.gif"28: "http://www.huffingtonpost.com/images/trans.gif"29: "http://www.huffingtonpost.com/images/trans.gif"30: "http://www.huffingtonpost.com/images/trans.gif"31: "http://images.huffingtonpost.com/2012-04-06-Screenshot20120406at7.09.17PM.jpg"32: "http://www.huffingtonpost.com/images/trans.gif"33: "http://www.huffingtonpost.com/images/trans.gif"34: "http://www.huffingtonpost.com/images/trans.gif"35: "http://www.huffingtonpost.com/images/trans.gif"36: "http://www.huffingtonpost.com/images/trans.gif"37: "http://www.huffingtonpost.com/images/trans.gif"38: "http://www.huffingtonpost.com/images/trans.gif"39: "http://www.huffingtonpost.com/images/trans.gif"40: "http://www.huffingtonpost.com/images/trans.gif"41: "http://www.huffingtonpost.com/images/trans.gif"42: "http://www.huffingtonpost.com/images/trans.gif"43: "http://www.huffingtonpost.com/images/trans.gif"44: "http://www.huffingtonpost.com/images/trans.gif"45: "http://www.huffingtonpost.com/images/trans.gif"46: "http://www.huffingtonpost.com/images/trans.gif"47: "http://www.huffingtonpost.com/images/trans.gif"48: "http://www.huffingtonpost.com/images/trans.gif"49: "http://www.huffingtonpost.com/images/trans.gif"50: "http://www.huffingtonpost.com/images/trans.gif"51: "http://www.huffingtonpost.com/images/trans.gif"52: "http://www.huffingtonpost.com/images/trans.gif"53: "http://www.huffingtonpost.com/images/trans.gif"54: "http://www.huffingtonpost.com/images/trans.gif"55: "http://www.huffingtonpost.com/images/trans.gif"56: "http://www.huffingtonpost.com/images/trans.gif"57: "http://www.huffingtonpost.com/images/trans.gif"58: "http://www.huffingtonpost.com/images/trans.gif"59: "http://www.huffingtonpost.com/images/trans.gif"60: "http://www.huffingtonpost.com/images/trans.gif"61: "http://www.huffingtonpost.com/images/trans.gif"62: "http://www.huffingtonpost.com/images/trans.gif"63: "http://www.huffingtonpost.com/images/trans.gif"64: "http://www.huffingtonpost.com/images/trans.gif"65: "http://www.huffingtonpost.com/images/trans.gif"66: "http://www.huffingtonpost.com/images/trans.gif"67: "http://www.huffingtonpost.com/images/trans.gif"68: "http://www.huffingtonpost.com/images/trans.gif"69: "http://www.huffingtonpost.com/images/trans.gif"70: "http://www.huffingtonpost.com/images/trans.gif"71: "http://www.huffingtonpost.com/images/trans.gif"72: "http://www.huffingtonpost.com/images/trans.gif"73: "http://www.huffingtonpost.com/images/trans.gif"74: "http://www.huffingtonpost.com/images/trans.gif"75: "http://s.huffpost.com/images/blank.gif"76: "http://s.huffpost.com/images/blank.gif"77: "http://s.huffpost.com/images/blank.gif"78: "http://s.huffpost.com/images/blank.gif"79: "http://s.huffpost.com/images/blank.gif"80: "http://s.huffpost.com/images/blank.gif"81: "http://s.huffpost.com/images/blank.gif"82: "http://s.huffpost.com/images/facebook_promo_connect.png?3"83: "http://s.huffpost.com/images/loader.gif"84: "http://www.huffingtonpost.com/images/trans.gif"85: "http://www.huffingtonpost.com/images/trans.gif"86: "http://www.huffingtonpost.com/images/trans.gif"87: "http://www.huffingtonpost.com/images/trans.gif"88: "http://www.huffingtonpost.com/images/trans.gif"89: "http://www.huffingtonpost.com/images/trans.gif"90: "http://s.huffpost.com/contributors/gary-hart/headshot.jpg"91: "http://s.huffpost.com/contributors/mike-campbell/headshot.jpg"92: "http://s.huffpost.com/contributors/roma-downey/headshot.jpg"93: "http://s.huffpost.com/contributors/gavin-newsom/headshot.jpg"94: "http://s.huffpost.com/contributors/sarah-shourd/headshot.jpg"95: "http://s.huffpost.com/contributors/jacqueline-novogratz/headshot.jpg"96: "http://s.huffpost.com/contributors/peggy-drexler/headshot.jpg"97: "http://s.huffpost.com/contributors/mohamed-a-elerian/headshot.jpg"98: "http://s.huffpost.com/contributors/bill-mckibben/headshot.jpg"99: "http://s.huffpost.com/contributors/marlo-thomas/headshot.jpg"100: "http://www.huffingtonpost.com/images/v/something_to_say_button.png"101: "http://www.huffingtonpost.com/images/trans.gif"102: "http://www.huffingtonpost.com/images/trans.gif"103: "http://www.huffingtonpost.com/images/trans.gif"104: "http://www.huffingtonpost.com/images/trans.gif"105: "http://www.huffingtonpost.com/images/trans.gif"106: "http://www.huffingtonpost.com/images/trans.gif"107: "http://www.huffingtonpost.com/images/trans.gif"108: "http://www.huffingtonpost.com/images/trans.gif"109: "http://www.huffingtonpost.com/images/trans.gif"110: "http://www.huffingtonpost.com/images/trans.gif"111: "http://www.huffingtonpost.com/images/trans.gif"112: "http://www.huffingtonpost.com/images/trans.gif"113: "http://www.huffingtonpost.com/images/trans.gif"114: "http://www.huffingtonpost.com/images/trans.gif"115: "http://www.huffingtonpost.com/images/trans.gif"116: "http://www.huffingtonpost.com/images/trans.gif"117: "http://www.huffingtonpost.com/images/trans.gif"118: "http://www.huffingtonpost.com/images/trans.gif"119: "http://www.huffingtonpost.com/images/trans.gif"120: "http://www.huffingtonpost.com/images/trans.gif"121: "http://www.huffingtonpost.com/images/trans.gif"122: "http://www.huffingtonpost.com/images/trans.gif"123: "http://www.huffingtonpost.com/images/trans.gif"124: "http://www.huffingtonpost.com/images/trans.gif"125: "http://www.huffingtonpost.com/images/trans.gif"126: "http://www.huffingtonpost.com/images/trans.gif"127: "http://www.huffingtonpost.com/images/trans.gif"128: "http://www.huffingtonpost.com/images/trans.gif"129: "http://www.huffingtonpost.com/images/trans.gif"130: "http://www.huffingtonpost.com/images/trans.gif"131: "http://www.huffingtonpost.com/images/trans.gif"132: "http://www.huffingtonpost.com/images/trans.gif"133: "http://www.huffingtonpost.com/images/trans.gif"134: "http://b.scorecardresearch.com/p?c1=2&c2=6723616&c3=&c4=&c5=front&c6=&c15=&cj=1"135: "http://www.huffingtonpost.com//secure-us.imrworldwide.com/cgi-bin/m?ci=us-703240h&cg=0&cc=1&ts=noscript"136: "http://vertical-stats.huffpost.com/?-1&&"137: "http://www.huffingtonpost.com//pixel.quantserve.com/pixel/p-6fTutip1SMLM2.gif?labels=Home"images_count: 138redirected: falsestatus: "success"title: "Breaking News and Opinion on The Huffington Post"type: "text/html; charset=utf-8"
带宽也是个问题
刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpost.com HTTP/1.……
引用 38 楼 的回复:
刚注册了http://pinterest.com。 它的做法就是用客户端来加载
点击Add 选择Pin ,贴上网址 http://www.huffingtonpost.com/
在chrome的Network中可以看到有一个请求
GET /pin/create/find_images/?url=http%253A%2F%2Fwww.huffingtonpo……
引用 29 楼 的回复:
刚才又去找了一下,终于还是把那个帖子找到了,你可以去学习一下~
http://www.planeart.cn/?p=1121
你就不能做个示例代码吗?
单元以后能看懂
楼主辛苦支持一下。。
楼主辛苦支持一下。。