搜索
首页后端开发php教程Scraping Links With PHP

Scraping Links With PHP

by justin on August 11, 2007

FROM:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

 

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn How to use cURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so you can extract links. Use XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else you could use a scraper for. Legal issues associated with scraping content. What You Will Need Basic knowledge of PHP and MySQL. A web server running PHP 5. The cURL extension for PHP. MySQL ? if you want to store the links. Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) {echo "<br />cURL error number:" .curl_errno($ch);echo "<br />cURL error:" . curl_error($ch);exit;}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT ? see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents Google ? Googlebot/2.1 ( http://www.googlebot.com/bot.html) Google Image ? Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html) MSN Live ? msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo ? Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ask Browser User Agents Firefox (WindowsXP) ? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 IE 7 ? Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30) IE 6 ? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Safari ? Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2 Opera ? Opera/9.00 (Windows NT 5.1; U; en) Using PHP’s DOM Functions To Parse The HTML

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM ? Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");
Iterate And Store Your Links

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {$href = $hrefs->item($i);$url = $href->getAttribute('href');storeLink($url,$target_url);}
 
 
FULL PROGRAM:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) { echo "
cURL error number:" .curl_errno($ch); echo "
cURL error:" . curl_error($ch); exit;}$dom = new DOMDocument();@$dom->loadHTML($html);$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); echo $url; echo "
"; }

?>
then you can store url to your database. more details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content
   
REF:tutorial on PHP and cURL 
You can find a comprehensive list of user agents here: User Agents.
Using PHP TO Scrape Sites As Feeds
   
 
声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
11个最佳PHP URL缩短脚本(免费和高级)11个最佳PHP URL缩短脚本(免费和高级)Mar 03, 2025 am 10:49 AM

长URL(通常用关键字和跟踪参数都混乱)可以阻止访问者。 URL缩短脚本提供了解决方案,创建了简洁的链接,非常适合社交媒体和其他平台。 这些脚本对于单个网站很有价值

Instagram API简介Instagram API简介Mar 02, 2025 am 09:32 AM

在Facebook在2012年通过Facebook备受瞩目的收购之后,Instagram采用了两套API供第三方使用。这些是Instagram Graph API和Instagram Basic Display API。作为开发人员建立一个需要信息的应用程序

在Laravel中使用Flash会话数据在Laravel中使用Flash会话数据Mar 12, 2025 pm 05:08 PM

Laravel使用其直观的闪存方法简化了处理临时会话数据。这非常适合在您的应用程序中显示简短的消息,警报或通知。 默认情况下,数据仅针对后续请求: $请求 -

简化的HTTP响应在Laravel测试中模拟了简化的HTTP响应在Laravel测试中模拟了Mar 12, 2025 pm 05:09 PM

Laravel 提供简洁的 HTTP 响应模拟语法,简化了 HTTP 交互测试。这种方法显着减少了代码冗余,同时使您的测试模拟更直观。 基本实现提供了多种响应类型快捷方式: use Illuminate\Support\Facades\Http; Http::fake([ 'google.com' => 'Hello World', 'github.com' => ['foo' => 'bar'], 'forge.laravel.com' =>

构建具有Laravel后端的React应用程序:第2部分,React构建具有Laravel后端的React应用程序:第2部分,ReactMar 04, 2025 am 09:33 AM

这是有关用Laravel后端构建React应用程序的系列的第二个也是最后一部分。在该系列的第一部分中,我们使用Laravel为基本的产品上市应用程序创建了一个RESTFUL API。在本教程中,我们将成为开发人员

php中的卷曲:如何在REST API中使用PHP卷曲扩展php中的卷曲:如何在REST API中使用PHP卷曲扩展Mar 14, 2025 am 11:42 AM

PHP客户端URL(curl)扩展是开发人员的强大工具,可以与远程服务器和REST API无缝交互。通过利用Libcurl(备受尊敬的多协议文件传输库),PHP curl促进了有效的执行

在Codecanyon上的12个最佳PHP聊天脚本在Codecanyon上的12个最佳PHP聊天脚本Mar 13, 2025 pm 12:08 PM

您是否想为客户最紧迫的问题提供实时的即时解决方案? 实时聊天使您可以与客户进行实时对话,并立即解决他们的问题。它允许您为您的自定义提供更快的服务

宣布 2025 年 PHP 形势调查宣布 2025 年 PHP 形势调查Mar 03, 2025 pm 04:20 PM

2025年的PHP景观调查调查了当前的PHP发展趋势。 它探讨了框架用法,部署方法和挑战,旨在为开发人员和企业提供见解。 该调查预计现代PHP Versio的增长

See all articles

热AI工具

Undresser.AI Undress

Undresser.AI Undress

人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

Undress AI Tool

免费脱衣服图片

Clothoff.io

Clothoff.io

AI脱衣机

AI Hentai Generator

AI Hentai Generator

免费生成ai无尽的。

热门文章

R.E.P.O.能量晶体解释及其做什么(黄色晶体)
2 周前By尊渡假赌尊渡假赌尊渡假赌
仓库:如何复兴队友
4 周前By尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island冒险:如何获得巨型种子
3 周前By尊渡假赌尊渡假赌尊渡假赌

热工具

SublimeText3 Mac版

SublimeText3 Mac版

神级代码编辑软件(SublimeText3)

SublimeText3 Linux新版

SublimeText3 Linux新版

SublimeText3 Linux最新版

SecLists

SecLists

SecLists是最终安全测试人员的伙伴。它是一个包含各种类型列表的集合,这些列表在安全评估过程中经常使用,都在一个地方。SecLists通过方便地提供安全测试人员可能需要的所有列表,帮助提高安全测试的效率和生产力。列表类型包括用户名、密码、URL、模糊测试有效载荷、敏感数据模式、Web shell等等。测试人员只需将此存储库拉到新的测试机上,他就可以访问到所需的每种类型的列表。

WebStorm Mac版

WebStorm Mac版

好用的JavaScript开发工具

SublimeText3 英文版

SublimeText3 英文版

推荐:为Win版本,支持代码提示!