Scraping Links With PHP
by justin on August 11, 2007
FROM:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content
In this tutorial you will learn how to build a PHP script that scrapes links from any web page.
What You’ll Learn How to use cURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so you can extract links. Use XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else you could use a scraper for. Legal issues associated with scraping content. What You Will Need Basic knowledge of PHP and MySQL. A web server running PHP 5. The cURL extension for PHP. MySQL ? if you want to store the links. Get The Page ContentcURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:
$ch = curl_init();curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) {echo "<br />cURL error number:" .curl_errno($ch);echo "<br />cURL error:" . curl_error($ch);exit;}
If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.
curl_setopt($ch, CURLOPT_URL,$target_url);
This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT ? see below). You can read an in depth tutorial on PHP and cURL here.
Tip: Fake Your User AgentMany websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.
Common User AgentsI’ve done a bit of the leg work for you and gathered the most common user agents:
Search Engine User Agents Google ? Googlebot/2.1 ( http://www.googlebot.com/bot.html) Google Image ? Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html) MSN Live ? msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo ? Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ask Browser User Agents Firefox (WindowsXP) ? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 IE 7 ? Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30) IE 6 ? Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Safari ? Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2 Opera ? Opera/9.00 (Windows NT 5.1; U; en) Using PHP’s DOM Functions To Parse The HTMLPHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM ? Document Object Model). Let’s see how we do it:
$dom = new DOMDocument();@$dom->loadHTML($html);
Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!
Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.
XPath Makes Getting The Links You Want EasyNow for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:
$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");Iterate And Store Your Links
Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:
for ($i = 0; $i < $hrefs->length; $i++) {$href = $hrefs->item($i);$url = $href->getAttribute('href');storeLink($url,$target_url);}
FULL PROGRAM:
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);curl_setopt($ch, CURLOPT_URL,$target_url);curl_setopt($ch, CURLOPT_FAILONERROR, true);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);curl_setopt($ch, CURLOPT_AUTOREFERER, true);curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);curl_setopt($ch, CURLOPT_TIMEOUT, 10);$html = curl_exec($ch);if (!$html) { echo "
cURL error number:" .curl_errno($ch); echo "
cURL error:" . curl_error($ch); exit;}$dom = new DOMDocument();@$dom->loadHTML($html);$xpath = new DOMXPath($dom);$hrefs = $xpath->evaluate("/html/body//a");for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); echo $url; echo "
?>
"; }
then you can store url to your database. more details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content
REF:tutorial on PHP and cURL
You can find a comprehensive list of user agents here: User Agents.
Using PHP TO Scrape Sites As Feeds

Long URLs, often cluttered with keywords and tracking parameters, can deter visitors. A URL shortening script offers a solution, creating concise links ideal for social media and other platforms. These scripts are valuable for individual websites a

Laravel simplifies handling temporary session data using its intuitive flash methods. This is perfect for displaying brief messages, alerts, or notifications within your application. Data persists only for the subsequent request by default: $request-

This is the second and final part of the series on building a React application with a Laravel back-end. In the first part of the series, we created a RESTful API using Laravel for a basic product-listing application. In this tutorial, we will be dev

Laravel provides concise HTTP response simulation syntax, simplifying HTTP interaction testing. This approach significantly reduces code redundancy while making your test simulation more intuitive. The basic implementation provides a variety of response type shortcuts: use Illuminate\Support\Facades\Http; Http::fake([ 'google.com' => 'Hello World', 'github.com' => ['foo' => 'bar'], 'forge.laravel.com' =>

The PHP Client URL (cURL) extension is a powerful tool for developers, enabling seamless interaction with remote servers and REST APIs. By leveraging libcurl, a well-respected multi-protocol file transfer library, PHP cURL facilitates efficient execution of various network protocols, including HTTP, HTTPS, and FTP. This extension offers granular control over HTTP requests, supports multiple concurrent operations, and provides built-in security features.

Do you want to provide real-time, instant solutions to your customers' most pressing problems? Live chat lets you have real-time conversations with customers and resolve their problems instantly. It allows you to provide faster service to your custom

The 2025 PHP Landscape Survey investigates current PHP development trends. It explores framework usage, deployment methods, and challenges, aiming to provide insights for developers and businesses. The survey anticipates growth in modern PHP versio

In this article, we're going to explore the notification system in the Laravel web framework. The notification system in Laravel allows you to send notifications to users over different channels. Today, we'll discuss how you can send notifications ov


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

Atom editor mac version download
The most popular open source editor

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver Mac version
Visual web development tools
