Home >Backend Development >PHP Tutorial > 流方式实现多线程采集有关问题,请高手分析上

流方式实现多线程采集有关问题,请高手分析上

WBOY
WBOYOriginal
2016-06-13 13:09:17893browse

流方式实现多线程采集问题,请高手分析下
采集内容速度慢,我一直很头大,最近在研究多线程采集,下面贴出比较代码,有两个问题,一是获取的结果长度有点不一致;二是效率是不是还不够高?大伙帮忙分析,测试!

PHP code
<!--

Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/

-->
<?php $timeStart = microtimeFloat();
function microtimeFloat() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
$data = '';
$urls = array('http://www.tzksgs.com/news/2012-09/article-217.html', 'http://www.tzksgs.com/news/2012-09/article-219.html', 'http://www.tzksgs.com/news/2012-09/article-222.html');
foreach($urls as $url){
    echo strlen(file_get_contents($url)),'<br>';
}
$timeEnd = microtimeFloat();
echo sprintf("Spend time: %s second(s)\n", $timeEnd - $timeStart),'<br>';
$timeStart = microtimeFloat();
$timeout = 30;
$status = array();
$retdata = array();
$sockets = array();
$userAgent = $_SERVER['HTTP_USER_AGENT'];
foreach($urls as $id => $url) {
    $tmp = parse_url($url);
    $host = $tmp['host'];
    $path = isset($tmp['path'])?$tmp['path']:'/';
    empty($tmp['query']) or $path .= '?' . $tmp['query'];
    if (empty($tmp['port'])) {
        $port = $tmp['scheme'] == 'https' ? 443 : 80;
    } else $port = $tmp['port'];
    $fp = stream_socket_client("$host:$port", $errno, $errstr, 30);
    if (!$fp) {
        $status[$id] = "failed, $errno $errstr";
    } else {
        $status[$id] = "in progress";
        $retdata[$id] = '';
        $sockets[$id] = $fp;
        fwrite($fp, "GET $path HTTP/1.1\r\nHost: $host\r\nUser-Agent: $userAgent\r\nConnection: Close\r\n\r\n");
    }
}
// Now, wait for the results to come back in

while (count($sockets)) {
    $read = $write = $sockets;
    //This is the magic function - explained below
    if (stream_select($read, $write = null, $e = null, $timeout)) {
        //readable sockets either have data for us, or are failed connection attempts
        foreach ($read as $r) {
            $id = array_search($r, $sockets);
            $data = fread($r, 8192);
            if (strlen($data) == 0) {
                if ($status[$id] == "in progress") {
                    $status[$id] = "failed to connect";
                }
                fclose($r);
                unset($sockets[$id]);
            } else {
                $retdata[$id] .= $data;
            }
        }
    }
}
foreach($retdata as $data){
    $data = trim(substr($data, strpos($data, "\r\n\r\n") + 4));
    echo strlen($data),'<br>';
}
$timeEnd = microtimeFloat();
echo sprintf("Spend time: %s second(s)\n", $timeEnd - $timeStart);
?>



------解决方案--------------------
你可以尝试 curl_multi_.... 并发执行
这样可尽可能的减少 php 指令,至于楼上两位说的问题。绝不是php所能解决的

------解决方案--------------------
当然,file_get_contents()是阻塞型的,所以如果是执行多个抓取任务,当然会慢。
而socket_*(), fsockopen(), stream_*()都是非阻塞的。
------解决方案--------------------
慢到什么程度? 

试下加上这个:

$context = stream_context_create(array('http' => array('header'=>'Connection: close')));
file_get_contents(".....",false,$context);
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn