采集的实现_PHP

WBOY
WBOYoriginal
2016-06-01 12:32:231074parcourir

采集

一般是本机运行,放到空间上是不明智的,因为不但很耗资源还需要支持远程抓取函数,比如file_get_contents($urls)file($url)等.
1,文章列表页面的自动切换,以及文章路径的获得.
2,获得:标题,内容
3,入库
4,问题
1,文章列表页面的自动切换,以及文章路径的获得.

a,列表页面的自动切换一般依赖动态页面来实现.比如

//2004-11-22 clinch
//$e=clinchgeturl("[url]im286.com/forumdisplay.php?fid=1");[/url]

//var_dump($e);
function clinchgeturl($url
)
{

//$url="[url]127.0.0.1/1.htm";[/url]
//$rootpath="[url]fsrootpathfsfsf/yyyyyy/";                           [/url]
//var_dump($rrr);
if(eregi('(.)*[\.](.)*',$url
)){
                                     
$roopath=split("\/",$url
);
                                       
$rootpath="[url]"[/url].$roopath[2]."/"
;
                                   
$nnn=count($roopath)-1;for($yu=3;$yu$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                       
// var_dump($rootpath); //http: ,'',127.0.0.1,xnml,index.php     
                                    
}
          else{
$rootpath=$url;  
//var_dump($rootpath);
}
if(isset(
$url
)){
echo
"$url 有下列裢接:
"
;
$fcontents = file($url
);
while(list(,
$line)=each($fcontents
)){
while(
eregi('(href[[:space:]]*=[[:space:]]*"?[[:alnum:]:@/._-]+[\?]?[^\"]*"?)',$line,$regs
)){
//$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*\"?)([[:alnum:]:@/._-]+)(\"?)',"\\2",$regs[1]);
$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*[\"]?)([[:alnum:]:@/._-]+[\?]?[^\"]*)(\.*)[^\"\/]*([\"]?)',"\\2",$regs[1
]);

if(!
eregi('^http:\/\/',$regs[1
])){

        if(
eregi('^\.\.',$regs[1
])){
                                
//   $roopath=eregi_replace('(http:\/\/)?([[:alnum:]:@/._-]+)[[:alnum:]+](\.*)[[:alnum:]+]',"http:\/\/\\2",$url);
       
                                     
$roopath=split("\/",$rootpath
);
                                       
$rootpath=
"[url]"">http://www.im286.com/foru[/url] ... d=1&page=$i
可以在后面利用$i的自动增加或范围来实现,比如$i++;
也可以像penzi演示的那个一样,要从第几页到第几页,代码方面控制$i的范围就可以.

b,文章路径的获得分需要填正则和无需填正则2种:
1)无需填正则就是获得上面的文章列表页面的所有连接
  但是最好对连接进行过滤,处理---判断重复连接,只留一个,处理相对路径,变成绝对路径.比如../ 和./等.
以下是我写的乱七八糟的实现函数:
PHP:  [Copy to clipboard]
--------------------------------------------------------------------------------


//2004-11-22 clinch
//$e=clinchgeturl("[url]im286.com/forumdisplay.php?fid=1");[/url]

//var_dump($e);
function clinchgeturl($url
)
{

//$url="[url]127.0.0.1/1.htm";[/url]
//$rootpath="[url]fsrootpathfsfsf/yyyyyy/";                           [/url]
//var_dump($rrr);
if(eregi('(.)*[\.](.)*',$url
)){
                                     
$roopath=split("\/",$url
);
                                       
$rootpath="[url]"[/url].$roopath[2]."/"
;
                                   
$nnn=count($roopath)-1;for($yu=3;$yu$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                       
// var_dump($rootpath); //http: ,'',127.0.0.1,xnml,index.php     
                                    
}
          else{
$rootpath=$url;  
//var_dump($rootpath);
}
if(isset(
$url
)){
echo
"$url 有下列裢接:
"
;
$fcontents = file($url
);
while(list(,
$line)=each($fcontents
)){
while(
eregi('(href[[:space:]]*=[[:space:]]*"?[[:alnum:]:@/._-]+[\?]?[^\"]*"?)',$line,$regs
)){
//$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*\"?)([[:alnum:]:@/._-]+)(\"?)',"\\2",$regs[1]);
$regs[1] = eregi_replace('(href[[:space:]]*=[[:space:]]*[\"]?)([[:alnum:]:@/._-]+[\?]?[^\"]*)(\.*)[^\"\/]*([\"]?)',"\\2",$regs[1
]);

if(!
eregi('^http:\/\/',$regs[1
])){

        if(
eregi('^\.\.',$regs[1
])){
                                
//   $roopath=eregi_replace('(http:\/\/)?([[:alnum:]:@/._-]+)[[:alnum:]+](\.*)[[:alnum:]+]',"http:\/\/\\2",$url);
       
                                     
$roopath=split("\/",$rootpath
);
                                       
$rootpath="[url]".$roopath[2]."/"
;
                                        
//echo "这是根本d :"."\n";     
                                
$nnn=count($roopath)-1;for($yu=3;$yu$nnn;$yu++){$rootpath.=$roopath[$yu]."/";}
                                        
//var_dump($rootpath);
                                   
if(eregi('^\.\.[\/[:alnum:]]',$regs[1
])){
                                       
//echo "这是../目录/ :"."\n";     
                                     //$regs[1]="../xx/xxxxxx.xx";
                                   // $rr=split("\/",$regs[1]);                                          
                                      //for($oooi=1;$oooi
$rrr=$regs[1
];
                                                                        
//   {$rrr.="/".$rr[$oooi];
                                                         
$rrr = eregi_replace("^[\.][\.][\/]",'',$rrr);
/
Déclaration:
Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn
Article précédent:为加速 PHP 程序而努力_PHPArticle suivant:PHP 使用技巧_PHP