(转)php抓取网页内容汇总
①、使用php获取网页内容
http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html
header("Content-type: text/html; charset=utf-8");
1、
$xhr = new COM("MSXML2.XMLHTTP");
$xhr->open("GET","http://localhost/xxx.php?id=2",false);
$xhr->send();
echo $xhr->responseText
2、file_get_contents实现
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
3、fopen()实现
if ($stream = fopen('http://www.sohu.com', 'r')) {
??? // print all the page starting at the offset 10
??? echo stream_get_contents($stream, -1, 10);
??? fclose($stream);
}
if ($stream = fopen('http://www.sohu.net', 'r')) {
??? // print the first 5 bytes
??? echo stream_get_contents($stream, 5);
??? fclose($stream);
}
?>
②、使用php获取网页内容
http://www.blogjava.net/pts/archive/2007/08/26/99188.html
简单的做法:
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
或者:
if ($stream = fopen('http://www.sohu.com', 'r')) {
??? // print all the page starting at the offset 10
??? echo stream_get_contents($stream, -1, 10);
??? fclose($stream);
}
if ($stream = fopen('http://www.sohu.net', 'r')) {
??? // print the first 5 bytes
??? echo stream_get_contents($stream, 5);
??? fclose($stream);
}
?>
③、PHP获取网站内容,保存为TXT文件源码
http://blog.chinaunix.net/u1/44325/showart_348444.html
$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';
ereg("http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/",$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, "r");//读取文件
unlink("test.txt");
while (!feof($file_handle)) { //循环到文件结束
??? $line = fgets($file_handle); //读取一行文件
??? $line1=ereg("href=\"[0-9]+.html",$line,$reg); //分析文件内部书的文章页面
?????? $handle = fopen("test.txt", 'a');
?? if ($line1) {
???? $my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
?? $my_book_txt_url=str_replace("href=\"","",$my_book_txt_url);
????? $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址
????? echo "$my_book_txt_over_url
????? $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址
????? while (!feof($file_handle_txt)) {
?????? $line_txt = fgets($file_handle_txt);
?????? $line1=ereg("^ .+",$line_txt,$reg); //根据抓取内容标示抓取
?????? $my_over_txt=$reg[0];
?????? $my_over_txt=str_replace(" ","??? ",$my_over_txt); //过滤字符
?????? $my_over_txt=str_replace("
","",$my_over_txt);
?????? $my_over_txt=str_replace("
?????? $my_over_txt=str_replace(""","",$my_over_txt);
?????? if ($line1) {
???????? $handle1=fwrite($handle,"$my_over_txt\n"); //写入文件
?????? }
????? }
??? }
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo "完成";
?>
下面是比较嚣张的方法。
这里使用一个名叫Snoopy的类。
先是在这里看到的:
PHP中获取网页内容的Snoopy包
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网:
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明:
代码收藏-Snoopy类及简单的使用方法
http://blog.passport86.com/?p=161
下载:http://sourceforge.net/projects/snoopy/
今天才发现这个好东西,赶紧去下载了来看看,是用的parse_url
还是比较习惯curl
snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
下面是它的一些特征:
1、方便抓取网页的内容
2、方便抓取网页的文字(去掉HTML代码)
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向,并能控制转向深度
8、能把网页中的链接扩展成高质量的url(默认)
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架(v0.92增加)
11、支持再转向的时候传递cookies
具体使用请看下载文件中的说明。
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$snoopy->fetchform(“http://www.phpx.com/happy/logging.php?action=login“);
print$snoopy->results;
?>
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$submit_url=“http://www.phpx.com/happy/logging.php?action=login“;$submit_vars["loginmode"]=“normal“;
$submit_vars["styleid"]=“1“;
$submit_vars["cookietime"]=“315360000“;
$submit_vars["loginfield"]=“username“;
$submit_vars["username"]=“********“;//你的用户名
$submit_vars["password"]=“*******“;//你的密码
$submit_vars["questionid"]=“0“;
$submit_vars["answer"]=“”;
$submit_vars["loginsubmit"]=“提 交“;
$snoopy->submit($submit_url,$submit_vars);
print$snoopy->results;?>
下面是Snoopy的Readme
NAME:
??? Snoopy - the PHP net client v1.2.4
???
SYNOPSIS:
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->fetchtext("http://www.php.net/");
??? print $snoopy->results;
???
??? $snoopy->fetchlinks("http://www.phpbuilder.com/");
??? print $snoopy->results;
???
??? $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
???
??? $submit_vars["q"] = "amiga";
??? $submit_vars["submit"] = "Search!";
??? $submit_vars["searchhost"] = "Altavista";
??? ???
??? $snoopy->submit($submit_url,$submit_vars);
??? print $snoopy->results;
???
??? $snoopy->maxframes=5;
??? $snoopy->fetch("http://www.ispi.net/");
??? echo "
\n";<br>??? echo htmlentities($snoopy->results[0]);<br>??? echo htmlentities($snoopy->results[1]);<br>??? echo htmlentities($snoopy->results[2]);<br>??? echo "\n";
??? $snoopy->fetchform("http://www.altavista.com");
??? print $snoopy->results;
DESCRIPTION:
??? What is Snoopy?
???
??? Snoopy is a PHP class that simulates a web browser. It automates the
??? task of retrieving web page content and posting forms, for example.
??? Some of Snoopy's features:
???
??? * easily fetch the contents of a web page
??? * easily fetch the text from a web page (strip html tags)
??? * easily fetch the the links from a web page
??? * supports proxy hosts
??? * supports basic user/pass authentication
??? * supports setting user_agent, referer, cookies and header content
??? * supports browser redirects, and controlled depth of redirects
??? * expands fetched links to fully qualified URLs (default)
??? * easily submit form. data and retrieve the results
??? * supports following html frames (added v0.92)
??? * supports passing cookies on redirects (added v0.92)
???
???
REQUIREMENTS:
??? Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
??? which should be PHP 3.0.9 and up. For read timeout support, it requires
??? PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.
CLASS METHODS:
??? fetch($URI)
??? -----------
???
??? This is the method used for fetching the contents of a web page.
??? $URI is the fully qualified URL of the page to fetch.
??? The results of the fetch are stored in $this->results.
??? If you are fetching frames, then $this->results
??? contains each frame. fetched in an array.
??? ???
??? fetchtext($URI)
??? ---------------???
???
??? This behaves exactly like fetch() except that it only returns
??? the text from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? fetchform($URI)
??? ---------------???
???
??? This behaves exactly like fetch() except that it only returns
??? the form. elements from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? fetchlinks($URI)
??? ----------------
??? This behaves exactly like fetch() except that it only returns
??? the links from the page. By default, relative links are
??? converted to their fully qualified URL form.
??? submit($URI,$formvars)
??? ----------------------
???
??? This submits a form. to the specified $URI. $formvars is an
??? array of the form. variables to pass.
??? ???
??? ???
??? submittext($URI,$formvars)
??? --------------------------
??? This behaves exactly like submit() except that it only returns
??? the text from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? submitlinks($URI)
??? ----------------
??? This behaves exactly like submit() except that it only returns
??? the links from the page. By default, relative links are
??? converted to their fully qualified URL form.
CLASS VARIABLES:??? (default value in parenthesis)
??? $host??? ??? ??? the host to connect to
??? $port??? ??? ??? the port to connect to
??? $proxy_host??? ??? the proxy host to use, if any
??? $proxy_port??? ??? the proxy port to use, if any
??? $agent??? ??? ??? the user agent to masqerade as (Snoopy v0.1)
??? $referer??? ??? referer information to pass, if any
??? $cookies??? ??? cookies to pass if any
??? $rawheaders??? ??? other header info to pass, if any
??? $maxredirs??? ??? maximum redirects to allow. 0=none allowed. (5)
??? $offsiteok??? ??? whether or not to allow redirects off-site. (true)
??? $expandlinks??? whether or not to expand links to fully qualified URLs (true)
??? $user??? ??? ??? authentication username, if any
??? $pass??? ??? ??? authentication password, if any
??? $accept??? ??? ??? http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
??? $error??? ??? ??? where errors are sent, if any
??? $response_code??? responde code returned from server
??? $headers??? ??? headers returned from server
??? $maxlength??? ??? max return data length
??? $read_timeout??? timeout on read operations (requires PHP 4 Beta 4+)
??? ??? ??? ??? ??? set to 0 to disallow timeouts
??? $timed_out??? ??? true if a read operation timed out (requires PHP 4 Beta 4+)
??? $maxframes??? ??? number of frames we will follow
??? $status??? ??? ??? http status of fetch
??? $temp_dir??? ??? temp directory that the webserver can write to. (/tmp)
??? $curl_path??? ??? system path to cURL binary, set to false if none
???
EXAMPLES:
??? Example: ??? fetch a web page and display the return headers and
??? ??? ??? ??? the contents of the page (html-escaped):
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->user = "joe";
??? $snoopy->pass = "bloe";
???
??? if($snoopy->fetch("http://www.slashdot.org/"))
??? {
??? ??? echo "response code: ".$snoopy->response_code."
\n";
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."\n";
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example:??? submit a form. and print out the result headers
??? ??? ??? ??? and html-escaped page:
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
???
??? $submit_vars["q"] = "amiga";
??? $submit_vars["submit"] = "Search!";
??? $submit_vars["searchhost"] = "Altavista";
??? ???
??? if($snoopy->submit($submit_url,$submit_vars))
??? {
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."\n";
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example:??? showing functionality of all the variables:
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
??? $snoopy->proxy_host = "my.proxy.host";
??? $snoopy->proxy_port = "8080";
???
??? $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
??? $snoopy->referer = "http://www.microsnot.com/";
???
??? $snoopy->cookies["SessionID"] = 238472834723489l;
??? $snoopy->cookies["favoriteColor"] = "RED";
???
??? $snoopy->rawheaders["Pragma"] = "no-cache";
???
??? $snoopy->maxredirs = 2;
??? $snoopy->offsiteok = false;
??? $snoopy->expandlinks = false;
???
??? $snoopy->user = "joe";
??? $snoopy->pass = "bloe";
???
??? if($snoopy->fetchtext("http://www.phpbuilder.com"))
??? {
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."\n";
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example: ??? fetched framed content and display the results
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->maxframes = 5;
???
??? if($snoopy->fetch("http://www.ispi.net/"))
??? {
??? ??? echo "
".htmlspecialchars($snoopy->results[0])."\n";
??? ??? echo "
".htmlspecialchars($snoopy->results[1])."\n";
??? ??? echo "
".htmlspecialchars($snoopy->results[2])."\n";
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
?
?
<?php //获取所有内容url保存到文件function get_index($save_file, $prefix="index_"){ $count = 68; $i = 1; if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed"); while($i<$count){ $url = $prefix . $i .".htm"; echo "Get ". $url ."..."; $url_str = get_content_url(get_url($url)); echo " OKn"; fwrite($fp, $url_str); ++$i; } fclose($fp);}//获取目标多媒体对象function get_object($url_file, $save_file, $split="|--:**:--|"){ if (!file_exists($url_file)) die($url_file ." not exist"); $file_arr = file($url_file); if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content"); $url_arr = array_unique($file_arr); if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed"); foreach($url_arr as $url){ if (empty($url)) continue; echo "Get ". $url ."..."; $html_str = get_url($url); echo $html_str; echo $url; exit; $obj_str = get_content_object($html_str); echo " OKn"; fwrite($fp, $obj_str); } fclose($fp);}//遍历目录获取文件内容function get_dir($save_file, $dir){ $dp = opendir($dir); if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed"); while(($file = readdir($dp)) != false){ if ($file!="." && $file!=".."){ echo "Read file ". $file ."..."; $file_content = file_get_contents($dir . $file); $obj_str = get_content_object($file_content); echo " OKn"; fwrite($fp, $obj_str); } } fclose($fp);}//获取指定url内容function get_url($url){ $reg = '/^http://[^/].+$/'; if (!preg_match($reg, $url)) die($url ." invalid"); $fp = fopen($url, "r") or die("Open url: ". $url ." failed."); while($fc = fread($fp, 8192)){ $content .= $fc; } fclose($fp); if (empty($content)){ die("Get url: ". $url ." content failed."); } return $content;}//使用socket获取指定网页function get_content_by_socket($url, $host){ $fp = fsockopen($host, 80) or die("Open ". $url ." failed"); $header = "GET /".$url ." HTTP/1.1rn"; $header .= "Accept: */*rn"; $header .= "Accept-Language: zh-cnrn"; $header .= "Accept-Encoding: gzip, deflatern"; $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)rn"; $header .= "Host: ". $host ."rn"; $header .= "Connection: Keep-Alivern"; //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-rnrn"; $header .= "Connection: Closernrn"; fwrite($fp, $header); while (!feof($fp)) { $contents .= fgets($fp, 8192); } fclose($fp); return $contents;}//获取指定内容里的urlfunction get_content_url($host_url, $file_contents){ //$reg = '/^(#|javascript.*?|ftp://.+|http://.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i'; //$reg = '/^(down.*?.html|d+_d+.htm.*?)$/i'; $rex = "/([hH][rR][eE][Ff])s*=s*['"]*([^>'"s]+)["'>]*s*/i"; $reg = '/^(down.*?.html)$/i'; preg_match_all ($rex, $file_contents, $r); $result = ""; //array(); foreach($r as $c){ if (is_array($c)){ foreach($c as $d){ if (preg_match($reg, $d)){ $result .= $host_url . $d."n"; } } } } return $result;}//获取指定内容中的多媒体文件function get_content_object($str, $split="|--:**:--|"){ $regx = "/hrefs*=s*['"]*([^>'"s]+)["'>]*s*(<b>.*?</b>)/i"; preg_match_all($regx, $str, $result); if (count($result) == 3){ $result[2] = str_replace("<b>多媒体: ", "", $result[2]); $result[2] = str_replace("</b>", "", $result[2]); $result = $result[1][0] . $split .$result[2][0] . "n"; } return $result;}?>
php抓取网页特定div区块及图片
(2009-06-05 09:56:23)
标签:php抓取图片it |
分类: PHP |
1. 取得指定網頁內的所有圖片:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');
//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/]*>/Ui',$text, $match);
//印出match
print_r($match);
?>
-----------------
2. 取得指定網頁內的第一張圖片:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');
//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/]*>/Ui',$text, $match);
//印出match
print_r($match);
?>
------------------------------------
3. 取得指定網頁內的特定div區塊(藉由id判斷):
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');
//去除換行及空白字元(序列化內容才需使用)
//$text=str_replace(array("\r","\n","\t","\s"),'', $text);? ?
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/
//印出match[0]
print($match[0]);
?>
-------------------------------------------
4. 上述2及3的結合:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');???
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/
//取得第一個img標籤,並儲存至陣列match2
preg_

PHP在現代編程中仍然是一個強大且廣泛使用的工具,尤其在web開發領域。 1)PHP易用且與數據庫集成無縫,是許多開發者的首選。 2)它支持動態內容生成和麵向對象編程,適合快速創建和維護網站。 3)PHP的性能可以通過緩存和優化數據庫查詢來提升,其廣泛的社區和豐富生態系統使其在當今技術棧中仍具重要地位。

在PHP中,弱引用是通過WeakReference類實現的,不會阻止垃圾回收器回收對象。弱引用適用於緩存系統和事件監聽器等場景,需注意其不能保證對象存活,且垃圾回收可能延遲。

\_\_invoke方法允許對象像函數一樣被調用。 1.定義\_\_invoke方法使對象可被調用。 2.使用$obj(...)語法時,PHP會執行\_\_invoke方法。 3.適用於日誌記錄和計算器等場景,提高代碼靈活性和可讀性。

Fibers在PHP8.1中引入,提升了並發處理能力。 1)Fibers是一種輕量級的並發模型,類似於協程。 2)它們允許開發者手動控制任務的執行流,適合處理I/O密集型任務。 3)使用Fibers可以編寫更高效、響應性更強的代碼。

PHP社區提供了豐富的資源和支持,幫助開發者成長。 1)資源包括官方文檔、教程、博客和開源項目如Laravel和Symfony。 2)支持可以通過StackOverflow、Reddit和Slack頻道獲得。 3)開發動態可以通過關注RFC了解。 4)融入社區可以通過積極參與、貢獻代碼和學習分享來實現。

PHP和Python各有優勢,選擇應基於項目需求。 1.PHP適合web開發,語法簡單,執行效率高。 2.Python適用於數據科學和機器學習,語法簡潔,庫豐富。

PHP不是在消亡,而是在不斷適應和進化。 1)PHP從1994年起經歷多次版本迭代,適應新技術趨勢。 2)目前廣泛應用於電子商務、內容管理系統等領域。 3)PHP8引入JIT編譯器等功能,提升性能和現代化。 4)使用OPcache和遵循PSR-12標準可優化性能和代碼質量。

PHP的未來將通過適應新技術趨勢和引入創新特性來實現:1)適應云計算、容器化和微服務架構,支持Docker和Kubernetes;2)引入JIT編譯器和枚舉類型,提升性能和數據處理效率;3)持續優化性能和推廣最佳實踐。


熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

AI Hentai Generator
免費產生 AI 無盡。

熱門文章

熱工具

EditPlus 中文破解版
體積小,語法高亮,不支援程式碼提示功能

記事本++7.3.1
好用且免費的程式碼編輯器

SecLists
SecLists是最終安全測試人員的伙伴。它是一個包含各種類型清單的集合,這些清單在安全評估過程中經常使用,而且都在一個地方。 SecLists透過方便地提供安全測試人員可能需要的所有列表,幫助提高安全測試的效率和生產力。清單類型包括使用者名稱、密碼、URL、模糊測試有效載荷、敏感資料模式、Web shell等等。測試人員只需將此儲存庫拉到新的測試機上,他就可以存取所需的每種類型的清單。

MinGW - Minimalist GNU for Windows
這個專案正在遷移到osdn.net/projects/mingw的過程中,你可以繼續在那裡關注我們。 MinGW:GNU編譯器集合(GCC)的本機Windows移植版本,可自由分發的導入函式庫和用於建置本機Windows應用程式的頭檔;包括對MSVC執行時間的擴展,以支援C99功能。 MinGW的所有軟體都可以在64位元Windows平台上運作。

ZendStudio 13.5.1 Mac
強大的PHP整合開發環境