Home  >  Article  >  Backend Development  >  When updating the PHP platform development, several ways to crawl the page_PHP tutorial

When updating the PHP platform development, several ways to crawl the page_PHP tutorial

WBOY
WBOYOriginal
2016-07-20 11:14:13885browse

When we develop network programs, we often need to capture non-local files. Generally, we use php to simulate browser access, access the url address through http requests, and then get the html source code or xml data. We cannot output the data directly. We often need to extract the content and then format it to display it in a more friendly way.

1. The main method of crawling pages with PHP:

2. The main ways for PHP to parse html or xml code:

1. file() function

<?php
//定义url
$url='http://t.qq.com'; 
//fiel函数读取内容数组
$lines_array=file($url); 
//拆分数组为字符串 
$lines_string=implode('',$lines_array); 
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo $lines_string; 

2. file_get_contents() function
Using file_get_contents and fopen must enable allow_url_fopen. Method: Edit php.ini and set allow_url_fopen = On, when allow_url_fopen is turned off, neither fopen nor file_get_contents can open remote files.

<?php 
//定义url 
$url='http://t.qq.com';
 //file_get_contents函数远程读取数据
$lines_string=file_get_contents($url);
 //输出内容,嘿嘿,大家也可以保存在自己的服务器上 
echo htmlspecialchars($lines_string);

3. fopen()->fread()->fclose() mode

<?php 
//定义url
$url='http://t.qq.com';
 //fopen以二进制方式打开   
$handle=fopen($url,"rb");
//变量初始化
$lines_string="";
//循环读取数据
do{     
    $data=fread($handle,1024);     
    if(strlen($data)==0) { 
        break;    
    }     
$lines_string.=$data; 
}while(true);
//关闭fopen句柄,释放资源
fclose($handle);
 //输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo $lines_string;

4. curl method
Using curl requires space to enable curl. Method: Modify php.ini under Windows, remove the semicolon in front of extension=php_curl.dll, and require You need to copy ssleay32.dll and libeay32.dll to C: WINDOWSsystem32; under Linux, you need to install the curl extension.

<?php 
// 创建一个新cURL资源
$url='http://t.qq.com'; 
$ch=curl_init(); 
$timeout=5; 
// 设置URL和相应的选项
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// 抓取URL
$lines_string=curl_exec($ch); 
// 关闭cURL资源,并且释放系统资源
curl_close($ch);
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo $lines_string;

5. fsockopen() function socket mode
Whether the socket mode can be executed correctly is also related to the server settings. You can check which communication protocols are enabled by the server through phpinfo.

<?php
$fp = fsockopen("t.qq.com", 80, $errno, $errstr, 30);
if (!$fp) {
    echo "$errstr ($errno)<br />\n";
} else {
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: t.qq.com\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    while (!feof($fp)) {
        echo fgets($fp, 128);
    }
    fclose($fp);
}

6.

<?php
//引入snoopy的类文件
require('Snoopy.class.php');
//初始化snoopy类
$snoopy = new Snoopy;
$url = "http://t.qq.com";
//开始采集内容
$snoopy->fetch($url);
 //保存采集内容到$lines_string
$lines_string = $snoopy->results;
//输出内容,嘿嘿,大家也可以保存在自己的服务器上 
echo $lines_string;

Note: Setting the agent is in line 45 of the Snoopy.class.php file. Please search for "var $agent" (content in quotation marks) in the file. You can use PHP to get the browser content.
Use echo $_SERVER['HTTP_USER_AGENT']; to get the browser information. Just copy the echoed content into the agent.

Nangikaze Koenko -- more serious PHP platform development

The complete page address of this page: http://www.cnblogs.com/rirber/archive/2013/06/15/php-server-get-curl-data.html

Shortened URL of this page (url address): http://url.cn/EeOhAy

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/440296.htmlTechArticleWhen we develop network programs, we often need to capture non-local files. Generally, we use PHP to simulate browsing. Access the server, access the url address through http request, and then get the html source code...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn