We cannot output the data directly. We often need to extract the content and then format it to display it in a more friendly way.
Let’s briefly talk about the main content of this article:
1. The main method of crawling pages with PHP:
1. file() function
2. file_get_contents() function
3. fopen()->fread()->fclose() mode
4. curl mode
5. fsockopen() function socket mode
6. Use plug-ins (such as: http ://sourceforge.net/projects/snoopy/)
2. The main ways for PHP to parse html or xml code:
1. Regular expressions
2. PHP DOMDocument object
3. Plug-ins (such as: PHP Simple HTML DOM Parser)
If you already know the above content well, the following content can be passed...
PHP crawl page
1. file() function
Copy code The code is as follows:
$url='http://t.qq.com';
$lines_array=file($url);
$lines_string=implode('',$lines_array);
echo htmlspecialchars($lines_string );
?>
2. file_get_contents() function
Use file_get_contents and fopen to enable allow_url_fopen. Method: Edit php.ini and set allow_url_fopen = On. When allow_url_fopen is turned off, neither fopen nor file_get_contents can open remote files.
Copy code The code is as follows:
$url='http://t.qq .com';
$lines_string=file_get_contents($url);
echo htmlspecialchars($lines_string);
?>
3. fopen()- >fread()->fclose() mode
Copy code The code is as follows:
php
$url='http://t.qq.com';
$handle=fopen($url,"rb");
$lines_string="";
do{
$data=fread($handle,1024);
if(strlen($data)==0){break;}
$lines_string.=$data;
}while(true);
fclose($handle);
echo htmlspecialchars($lines_string);
?>
4. Curl method
Using curl requires space to open curl . Method: Modify php.ini under Windows, remove the semicolon in front of extension=php_curl.dll, and copy ssleay32.dll and libeay32.dll to C:WINDOWSsystem32; install the curl extension under Linux.
Copy code The code is as follows:
$url='http://t.qq .com';
$ch=curl_init();
$timeout=5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$lines_string=curl_exec($ch);
curl_close($ch);
echo htmlspecialchars($lines_string);
?>
5. fsockopen() function socket mode
Whether the socket mode can be executed correctly is also related to the server settings. You can check which communication protocols are enabled by the server through phpinfo. For example, my local php socket does not have http enabled, so I can only use udp to test it.
Copy code The code is as follows:
$fp = fsockopen("udp://127.0 .0.1", 13, $errno, $errstr);
if (!$fp) {
echo "ERROR: $errno - $errstr
n";
} else {
fwrite($fp, "n");
echo fread($fp, 26);
fclose($fp);
}
?>
6. Plug-ins
There should be many plug-ins on the Internet. The snoopy plug-in was found online. If you are interested, you can research it.
PHP parses xml (html)
1. Regular expression:
Copy code The code is as follows:
$url='http://t.qq.com';
$lines_string=file_get_contents($url);
eregi('< ;title>(.*)',$lines_string,$title);
echo htmlspecialchars($title[0]);
?>
2. PHP DOMDocument() object
If there are syntax errors in the remote html or xml, PHP will report an error when parsing the dom.
Copy code The code is as follows:
$url='http:// www.136web.cn';
$html=new DOMDocument();
$html->loadHTMLFile($url);
$title=$html->getElementsByTagName('title');
echo $title->item(0)->nodeValue;
?>
3. Plug-ins
This article takes PHP Simple HTML DOM Parser as an example to give a brief introduction. The syntax of simple_html_dom is similar to jQuery. It allows PHP to operate dom as easily as using jQuery to operate dom.
Copy code The code is as follows:
$url='http://t.qq .com';
include_once('../simplehtmldom/simple_html_dom.php');
$html=file_get_html($url);
$title=$html->find('title') ;
echo $title[0]->plaintext;
?>
Of course the Chinese are creative, and foreigners tend to be ahead in technology, but Chinese people tend to be better at using it and often come up with some functions that foreigners dare not think of, such as remote crawling and analysis of PHP, which are originally intended to facilitate data integration. But Chinese people like this very much, so there are a large number of collection sites. They do not create any valuable content themselves, but rely on crawling other people's website content and making it their own. Enter the keyword "php small" in Baidu, and the first one on the suggestion list is "php thief program". Then I put the same keyword into Google, and I can only laugh and say nothing.
http://www.bkjia.com/PHPjc/322211.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/322211.htmlTechArticleWe cannot output the data directly. We often need to extract the content and then format it to make it more user-friendly. way is revealed. Let’s briefly talk about the main content of this article...