Home >Backend Development >PHP Tutorial >PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2016-07-20 11:16:581195browse

In data collection and page analysis, it is often necessary to capture the content of a given url page, or the second and third level in-depth page content.

Here is the implementation of a test example for reference only.

The code is as follows:

/*
Match the given page link
return:array match[link,content,all]
*/
function match_links($host, $document) {
$pattern = '/(.*?)/i';
preg_match_all($pattern, $document, $m);
return $m;

preg_match_all("']+))[^ >]*>?(.*?)'isx",$document,$links);
while(list($key,$val) = each($links[2])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[3])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[4])) {
if(!empty($val))
$match['content'][] = $val;
}
while(list($key,$val) = each($links[0])) {
if(!empty($val))
$match['all'][] = $val;
}
return $match['link'];
}

/*
Get the page text content from the given url
*/
function get_content_from_url($url) {
$str = @file_get_contents($url);
if(mb_check_encoding($str, "GBK"))
$str = iconv("GBK","UTF-8", $str);
$str = strip_tags($str); // Filter html tags
/*
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@<(.*?)>@is", "", $str );
*/
//Filter non-Chinese characters
preg_match_all('/[x{4e00}-x{9fff}]+/u', $str, $matches);
$str = join(',', $matches[0]);
if(!$str)
Return NULL;

return $str;
}

function get_content($url,$depth) {
if(!$url || $depth < 1)
return false;

while($depth > 1){
$str = @file_get_contents($url);
if(!$str)
Return false;

$parseurl = parse_url($url);
if($parseurl['host'])
$host = $parseurl[scheme] . "://" . $parseurl['host'];

$arrlink = match_links($host,$str);
$arr_url = array_unique($arrlink);

$depth--;
foreach($arr_url as $url){
$content .= get_content($url, $depth); //Recursive call
}
}

$content .= get_content_from_url($url);

return $content;
}

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Vulnerabilities on the web, analysis of their principles, and prevention methods_PHP tutorialNext article：Vulnerabilities on the web, analysis of their principles, and prevention methods_PHP tutorial

See more

PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

Related articles