Home  >  Article  >  Backend Development  >  PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

WBOY
WBOYOriginal
2016-07-20 11:16:581078browse

In data collection and page analysis, it is often necessary to capture the content of a given url page, or the second and third level in-depth page content.

Here is the implementation of a test example for reference only.

The code is as follows:


/*
Match the given page link
return:array match[link,content,all]
*/
function match_links($host, $document) {
$pattern = '/(.*?)/i';
preg_match_all($pattern, $document, $m);
return $m;

preg_match_all("']+))[^ >]*>?(.*?)'isx",$document,$links);
while(list($key,$val) = each($links[2])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[3])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[4])) {
if(!empty($val))
$match['content'][] = $val;
}
while(list($key,$val) = each($links[0])) {
if(!empty($val))
$match['all'][] = $val;
}
return $match['link'];
}

/*
Get the page text content from the given url
*/
function get_content_from_url($url) {
$str = @file_get_contents($url);
if(mb_check_encoding($str, "GBK"))
$str = iconv("GBK","UTF-8", $str);
$str = strip_tags($str); // Filter html tags
/*
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@<(.*?)>@is", "", $str );
*/
//Filter non-Chinese characters
preg_match_all('/[x{4e00}-x{9fff}]+/u', $str, $matches);
$str = join(',', $matches[0]);
if(!$str)
Return NULL;

return $str;
}

function get_content($url,$depth) {
if(!$url || $depth < 1)
return false;

while($depth > 1){
$str = @file_get_contents($url);
if(!$str)
Return false;

$parseurl = parse_url($url);
if($parseurl['host'])
$host = $parseurl[scheme] . "://" . $parseurl['host'];

$arrlink = match_links($host,$str);
$arr_url = array_unique($arrlink);

$depth--;
foreach($arr_url as $url){
$content .= get_content($url, $depth); //Recursive call
}
}

$content .= get_content_from_url($url);

return $content;
}

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/372096.htmlTechArticleIn data collection and page analysis, it is often necessary to capture the content of a given url page, or the second or third Three levels of depth page content. Here is an implementation of a test example for reference only. ...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn