Home  >  Article  >  Backend Development  >  How do I build a simple PHP crawler to extract links and content from a website?

How do I build a simple PHP crawler to extract links and content from a website?

Linda Hamilton
Linda HamiltonOriginal
2024-11-07 19:04:02821browse

How do I build a simple PHP crawler to extract links and content from a website?

Creating a Simple PHP Crawler

Crawling websites and extracting data is a common task in web programming. PHP provides a flexible framework for building crawlers, allowing you to access and retrieve information from remote web pages.

To create a simple PHP crawler that collects links and content from a given web page, you can utilize the following approach:

Using a DOM Parser:

<?php
function crawl_page($url, $depth = 5)
{
    // Prevent endless recursion and circular references
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    // Mark the URL as seen
    $seen[$url] = true;

    // Load the web page using DOM
    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    // Iterate over all anchor tags (<a>)
    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');

        // Convert relative URLs to absolute URLs
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) &amp;&amp; isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1) . $path;
            }
        }

        // Recursively crawl the linked page
        crawl_page($href, $depth - 1);
    }

    // Output the crawled page's URL and content
    echo "URL: " . $url . PHP_EOL . "CONTENT: " . PHP_EOL . $dom->saveHTML() . PHP_EOL . PHP_EOL;
}
crawl_page("http://example.com", 2);
?>

This crawler uses a DOM parser to navigate through the web page's HTML, identifies all anchor tags, and follows any links they contain. It collects the content of the linked pages and dumps it into the standard output. You can redirect this output to a text file to save the collected data locally.

Additional Features:

  • Prevents crawling the same URL multiple times.
  • Handles relative URLs correctly.
  • Supports HTTPS, user authentication, and port numbers when using the http PECL extension.

The above is the detailed content of How do I build a simple PHP crawler to extract links and content from a website?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn