Home >Backend Development >PHP Tutorial >How to Build a PHP Web Crawler to Gather Data from Multiple Links?

How to Build a PHP Web Crawler to Gather Data from Multiple Links?

Susan Sarandon
Susan SarandonOriginal
2024-11-08 06:50:02541browse

How to Build a PHP Web Crawler to Gather Data from Multiple Links?

PHP Web Crawler: Harvesting Data from Multiple Links

Question:

Create a PHP script to retrieve data from multiple links on a web page and store it in a local file.

Answer:

Using DOM and Depth Control:

function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        // Handle relative URLs
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }

    // Output data
    echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL;
}

// Usage
crawl_page("http://hobodave.com", 2);

Notes:

  • This version uses DOM parsing, which is more robust than RegEx parsing.
  • It handles relative URLs correctly.
  • It employs a depth control to prevent infinite looping.
  • The output is echoed to STDOUT, allowing you to redirect it to a file.

The above is the detailed content of How to Build a PHP Web Crawler to Gather Data from Multiple Links?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn