search
HomeBackend DevelopmentPHP Tutorial php网页分析 内容抓取 爬虫 资料分析

php网页分析 内容抓取 爬虫 文件分析

//获取所有内容url保存到文件
function get_index($save_file, $prefix="index_"){
    $count = 68;
    $i = 1;
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed");
    while($i        $url = $prefix . $i .".htm";
        echo "Get ". $url ."...";
        $url_str = get_content_url(get_url($url));
        echo " OK\n";
        fwrite($fp, $url_str);
        ++$i;
    }
    fclose($fp);
}

//获取目标多媒体对象
function get_object($url_file, $save_file, $split="|--:**:--|"){
    if (!file_exists($url_file)) die($url_file ." not exist");
    $file_arr = file($url_file);
    if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content");
    $url_arr = array_unique($file_arr);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    foreach($url_arr as $url){
        if (empty($url)) continue;
        echo "Get ". $url ."...";
        $html_str = get_url($url);
        echo $html_str;
        echo $url;
        exit;
        $obj_str = get_content_object($html_str);
        echo " OK\n";
        fwrite($fp, $obj_str);
    }
    fclose($fp);
}

//遍历目录获取文件内容
function get_dir($save_file, $dir){
    $dp = opendir($dir);
    if (file_exists($save_file)) @unlink($save_file);
    $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
    while(($file = readdir($dp)) != false){
        if ($file!="." && $file!=".."){
            echo "Read file ". $file ."...";
            $file_content = file_get_contents($dir . $file);
            $obj_str = get_content_object($file_content);
            echo " OK\n";
            fwrite($fp, $obj_str);
        }
    }
    fclose($fp);
}


//获取指定url内容
function get_url($url){
    $reg = '/^http:\/\/[^\/].+$/';
    if (!preg_match($reg, $url)) die($url ." invalid");
    $fp = fopen($url, "r") or die("Open url: ". $url ." failed.");
    while($fc = fread($fp, 8192)){
        $content .= $fc;
    }
    fclose($fp);
    if (empty($content)){
        die("Get url: ". $url ." content failed.");
    }
    return $content;
}

//使用socket获取指定网页
function get_content_by_socket($url, $host){
    $fp = fsockopen($host, 80) or die("Open ". $url ." failed");
    $header = "GET /".$url ." HTTP/1.1\r\n";
    $header .= "Accept: */*\r\n";
    $header .= "Accept-Language: zh-cn\r\n";
    $header .= "Accept-Encoding: gzip, deflate\r\n";
    $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)\r\n";
    $header .= "Host: ". $host ."\r\n";
    $header .= "Connection: Keep-Alive\r\n";
    //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-\r\n\r\n";
    $header .= "Connection: Close\r\n\r\n";

    fwrite($fp, $header);
    while (!feof($fp)) {
        $contents .= fgets($fp, 8192);
    }
    fclose($fp);
    return $contents;
}


//获取指定内容里的url
function get_content_url($host_url, $file_contents){

    //$reg = '/^(#|javascript.*?|ftp:\/\/.+|http:\/\/.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';
    //$reg = '/^(down.*?\.html|\d+_\d+\.htm.*?)$/i';
    $rex = "/([hH][rR][eE][Ff])\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*/i";
    $reg = '/^(down.*?\.html)$/i';
    preg_match_all ($rex, $file_contents, $r);
    $result = ""; //array();
    foreach($r as $c){
        if (is_array($c)){
            foreach($c as $d){
                if (preg_match($reg, $d)){ $result .= $host_url . $d."\n"; }
            }
        }
    }
    return $result;
}

//获取指定内容中的多媒体文件
function get_content_object($str, $split="|--:**:--|"){   
    $regx = "/href\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*(.*?)/i";
    preg_match_all($regx, $str, $result);

    if (count($result) == 3){
        $result[2] = str_replace("多媒体: ", "", $result[2]);
        $result[2] = str_replace("
", "", $result[2]);
        $result = $result[1][0] . $split .$result[2][0] . "\n";
    }
    return $result;
}

?>

//PHP 访问网页
$page = '';
$handler = fopen('http://www.baidu.com','r');
while(!feof($handler)){
   $page.=fread($handler,1048576);
}
fclose($handler);
echo $page;
?>


2:判断这个页面是否是报错页面 
  
/** 
      *   $host   服务器地址 
      *   $get   请求页面 
      * 
      */ 
function   getHttpStatus($host,$get="")   { 
$fp   =   fsockopen($host,   80); 
if   (!$fp)   { 
          $res=   -1; 
}   else   { 
  
          fwrite($fp,   "GET   /".$get."   HTTP/1.0\r\n\r\n"); 
          stream_set_timeout($fp,   2); 
          $res   =   fread($fp,   128); 
          $info   =   stream_get_meta_data($fp); 
          fclose($fp); 
          if   ($info['timed_out'])   { 
                  $res=0; 
          }   else   { 
                  $res=   substr($res,9,3); 
          } 

return   $res; 

  
$good=array("200","302"); 
if(in_array(getHttpStatus("5y.nuc.edu.cn","/"),$good))   { 
echo   "正常"; 
}   else   { 
echo   getHttpStatus("5y.nuc.edu.cn","/"); 

  
if(in_array(getHttpStatus("5y.nuc.edu.cn","/error.php"),$good))   { 
echo   "正常"; 
}   else   { 
echo   getHttpStatus("5y.nuc.edu.cn","/error.php"); 

?> 
  
返回 
第一个返回"正常" 
第二个不存在返回"404" 

function   getHttpStatus($host,$get="")   {
//访问网页 获得服务器状态码
$fp   =   fsockopen($host,   80); 
if   (!$fp)   { 
          $res=   -1; 
}   else   { 
  
          fwrite($fp,   "GET   /".$get."   HTTP/1.0\r\n\r\n"); 
          stream_set_timeout($fp,   2); 
          $res   =   fread($fp,   128); 
          $info   =   stream_get_meta_data($fp); 
          fclose($fp); 
          if   ($info['timed_out'])   { 
                  $res=0; 
          }   else   { 
                  $res=   substr($res,9,3); 
          } 

return   $res; 

  
echo   getHttpStatus("5y.nuc.edu.cn","/"); 
echo   getHttpStatus("community.csdn.net","Expert/topic/4758/4758574.xml?temp=.1661646"); 
  
返回 
1   无法连接服务器 
0   超时 
200   OK   (成功返回) 
302   Found   (找到) 
404   没有找到 
...

  
//遍历所有网页   ($type指定类型) 
function   getAllPage($path="./",$type=array("html","htm"))   { 
global   $p; 
if   ($handle   =   opendir($path))   { 
          while   (false   !==   ($file   =   readdir($handle)))   { 
              if(is_dir($file)   &&   $file!="."   &&   $file!="..")   { 
              getAllPage($path.$file."/",$type); 
              }   else   { 
                  $ex=array_pop(explode(".",$file)); 
                  
                  if(in_array(strtolower($ex),$type))   { 
                    array_push($p,   $path.$file); 
                  } 
                  } 
          } 
                  closedir($handle); 


  
$p=array(); 
getAllPage("./"); 
echo   "

";  <br>print_r($p);  <br>echo   "
"; 
  
?>


//抓取页面内容中所有URL。
$str='2006年05月30日   15:13:00   |   评论   (0)'; 
  
preg_match_all("/href=\"([^\"]*\.*php[^\"]*)\"/si",$str,$m); 
//换用下面这个获取所有类型URL 
//preg_match_all("/href=\"([^\"]*)\"/si",$str,$m); 
  
print_r($m[1]); 
?> 

包含在链接里带get参数的地址 
$str=file_get_contents("http://www.php.net"); 
preg_match_all("/href=\"([^\"]*\.*php\?[^\"]*)\"/si",$str,$m); 
print_r($m[1]); 
?>

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How does PHP identify a user's session?How does PHP identify a user's session?May 01, 2025 am 12:23 AM

PHPidentifiesauser'ssessionusingsessioncookiesandsessionIDs.1)Whensession_start()iscalled,PHPgeneratesauniquesessionIDstoredinacookienamedPHPSESSIDontheuser'sbrowser.2)ThisIDallowsPHPtoretrievesessiondatafromtheserver.

What are some best practices for securing PHP sessions?What are some best practices for securing PHP sessions?May 01, 2025 am 12:22 AM

The security of PHP sessions can be achieved through the following measures: 1. Use session_regenerate_id() to regenerate the session ID when the user logs in or is an important operation. 2. Encrypt the transmission session ID through the HTTPS protocol. 3. Use session_save_path() to specify the secure directory to store session data and set permissions correctly.

Where are PHP session files stored by default?Where are PHP session files stored by default?May 01, 2025 am 12:15 AM

PHPsessionfilesarestoredinthedirectoryspecifiedbysession.save_path,typically/tmponUnix-likesystemsorC:\Windows\TemponWindows.Tocustomizethis:1)Usesession_save_path()tosetacustomdirectory,ensuringit'swritable;2)Verifythecustomdirectoryexistsandiswrita

How do you retrieve data from a PHP session?How do you retrieve data from a PHP session?May 01, 2025 am 12:11 AM

ToretrievedatafromaPHPsession,startthesessionwithsession_start()andaccessvariablesinthe$_SESSIONarray.Forexample:1)Startthesession:session_start().2)Retrievedata:$username=$_SESSION['username'];echo"Welcome,".$username;.Sessionsareserver-si

How can you use sessions to implement a shopping cart?How can you use sessions to implement a shopping cart?May 01, 2025 am 12:10 AM

The steps to build an efficient shopping cart system using sessions include: 1) Understand the definition and function of the session. The session is a server-side storage mechanism used to maintain user status across requests; 2) Implement basic session management, such as adding products to the shopping cart; 3) Expand to advanced usage, supporting product quantity management and deletion; 4) Optimize performance and security, by persisting session data and using secure session identifiers.

How do you create and use an interface in PHP?How do you create and use an interface in PHP?Apr 30, 2025 pm 03:40 PM

The article explains how to create, implement, and use interfaces in PHP, focusing on their benefits for code organization and maintainability.

What is the difference between crypt() and password_hash()?What is the difference between crypt() and password_hash()?Apr 30, 2025 pm 03:39 PM

The article discusses the differences between crypt() and password_hash() in PHP for password hashing, focusing on their implementation, security, and suitability for modern web applications.

How can you prevent Cross-Site Scripting (XSS) in PHP?How can you prevent Cross-Site Scripting (XSS) in PHP?Apr 30, 2025 pm 03:38 PM

Article discusses preventing Cross-Site Scripting (XSS) in PHP through input validation, output encoding, and using tools like OWASP ESAPI and HTML Purifier.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment