Scraping Links With PHP-php教程-PHP中文網

首頁

後端開發

php教程

Scraping Links With PHP

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 23, 2016 pm 02:36 PM

Scraping Links With PHP

by justin on August 11, 2007

FROM:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn How to use cURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so you can extract links. Use XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else you could use a scraper for. Legal issues associated with scraping content. What You Will Need Basic knowledge of PHP and MySQL. A web server running PHP 5. The cURL extension for PHP. MySQL ? if you want to store the links. Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) {echo "<br />cURL error number:" .curl_errno($ch);echo "<br />cURL error:" . curl_error($ch);exit;}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT ? see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents Google ? Googlebot/2.1 ( http://www.googlebot.com/bot.html) Google Image ? Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html) MSN Live ? msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo ? Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ask Browser User Agents Firefox (WindowsXP) ? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 IE 7 ? Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30) IE 6 ? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Safari ? Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2 Opera ? Opera/9.00 (Windows NT 5.1; U; en) Using PHP’s DOM Functions To Parse The HTML

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM ? Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");

Iterate And Store Your Links

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {$href = $hrefs->item($i);$url = $href->getAttribute('href');storeLink($url,$target_url);}

FULL PROGRAM:


$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) { echo "
cURL error number:" .curl_errno($ch); echo "
cURL error:" . curl_error($ch); exit;}$dom = new DOMDocument();@$dom->loadHTML($html);$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {  $href = $hrefs->item($i); $url = $href->getAttribute('href');  echo $url; echo "
"; }
?>

then you can store url to your database. more details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

REF:tutorial on PHP and cURL

You can find a comprehensive list of user agents here: User Agents.

Using PHP TO Scrape Sites As Feeds

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

優化PHP代碼：減少內存使用和執行時間May 10, 2025 am 12:04 AM

TooptimizePHPcodeforreducedmemoryusageandexecutiontime,followthesesteps:1)Usereferencesinsteadofcopyinglargedatastructurestoreducememoryconsumption.2)LeveragePHP'sbuilt-infunctionslikearray_mapforfasterexecution.3)Implementcachingmechanisms,suchasAPC

PHP電子郵件：分步發送指南May 09, 2025 am 12:14 AM

phpisusedforsendendemailsduetoitsignegrationwithservermailservicesand andexternalsmtpproviders，自動化intifications andMarketingCampaigns.1）設置設置yourphpenvenvironnvironnvironmentwithaweberswithawebserverserververandphp，確保themailfunctionisenabled.2）useabasicscruct

如何通過PHP發送電子郵件：示例和代碼May 09, 2025 am 12:13 AM

發送電子郵件的最佳方法是使用PHPMailer庫。 1)使用mail()函數簡單但不可靠，可能導致郵件進入垃圾郵件或無法送達。 2)PHPMailer提供更好的控制和可靠性，支持HTML郵件、附件和SMTP認證。 3)確保正確配置SMTP設置並使用加密（如STARTTLS或SSL/TLS）以增強安全性。 4)對於大量郵件，考慮使用郵件隊列系統來優化性能。

高級PHP電子郵件：自定義標題和功能May 09, 2025 am 12:13 AM

CustomHeadersheadersandAdvancedFeaturesInphpeMailenHanceFunctionalityAndreliability.1）CustomHeadersheadersheadersaddmetadatatatatataatafortrackingandCategorization.2）htmlemailsallowformattingandttinganditive.3）attachmentscanmentscanmentscanbesmentscanbestmentscanbesentscanbesentingslibrarieslibrarieslibrariesliblarikelikephpmailer.4）smtppapapairatienticationaltication enterticationallimpr

使用PHP和SMTP發送電子郵件的指南May 09, 2025 am 12:06 AM

使用PHP和SMTP發送郵件可以通過PHPMailer庫實現。 1)安裝並配置PHPMailer，2)設置SMTP服務器細節，3)定義郵件內容，4)發送郵件並處理錯誤。使用此方法可以確保郵件的可靠性和安全性。

使用PHP發送電子郵件的最佳方法是什麼？May 08, 2025 am 12:21 AM

ThebestapproachforsendingemailsinPHPisusingthePHPMailerlibraryduetoitsreliability,featurerichness,andeaseofuse.PHPMailersupportsSMTP,providesdetailederrorhandling,allowssendingHTMLandplaintextemails,supportsattachments,andenhancessecurity.Foroptimalu

PHP中依賴注入的最佳實踐May 08, 2025 am 12:21 AM

使用依賴注入(DI)的原因是它促進了代碼的松耦合、可測試性和可維護性。 1)使用構造函數注入依賴，2)避免使用服務定位器，3)利用依賴注入容器管理依賴，4)通過注入依賴提高測試性，5)避免過度注入依賴，6)考慮DI對性能的影響。

PHP性能調整技巧和技巧May 08, 2025 am 12:20 AM

phpperformancetuningiscialbecapeitenhancesspeedandeffice，whatevitalforwebapplications.1）cachingwithapcureduccureducesdatabaseloadprovesrovessetimes.2）優化

See all articles