Home > Article > Backend Development > How do I build a simple PHP crawler to extract links and content from a website?
Creating a Simple PHP Crawler
Crawling websites and extracting data is a common task in web programming. PHP provides a flexible framework for building crawlers, allowing you to access and retrieve information from remote web pages.
To create a simple PHP crawler that collects links and content from a given web page, you can utilize the following approach:
Using a DOM Parser:
<?php function crawl_page($url, $depth = 5) { // Prevent endless recursion and circular references static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } // Mark the URL as seen $seen[$url] = true; // Load the web page using DOM $dom = new DOMDocument('1.0'); @$dom->loadHTMLFile($url); // Iterate over all anchor tags (<a>) $anchors = $dom->getElementsByTagName('a'); foreach ($anchors as $element) { $href = $element->getAttribute('href'); // Convert relative URLs to absolute URLs if (0 !== strpos($href, 'http')) { $path = '/' . ltrim($href, '/'); if (extension_loaded('http')) { $href = http_build_url($url, array('path' => $path)); } else { $parts = parse_url($url); $href = $parts['scheme'] . '://'; if (isset($parts['user']) && isset($parts['pass'])) { $href .= $parts['user'] . ':' . $parts['pass'] . '@'; } $href .= $parts['host']; if (isset($parts['port'])) { $href .= ':' . $parts['port']; } $href .= dirname($parts['path'], 1) . $path; } } // Recursively crawl the linked page crawl_page($href, $depth - 1); } // Output the crawled page's URL and content echo "URL: " . $url . PHP_EOL . "CONTENT: " . PHP_EOL . $dom->saveHTML() . PHP_EOL . PHP_EOL; } crawl_page("http://example.com", 2); ?>
This crawler uses a DOM parser to navigate through the web page's HTML, identifies all anchor tags, and follows any links they contain. It collects the content of the linked pages and dumps it into the standard output. You can redirect this output to a text file to save the collected data locally.
Additional Features:
The above is the detailed content of How do I build a simple PHP crawler to extract links and content from a website?. For more information, please follow other related articles on the PHP Chinese website!