Home >Backend Development >PHP Tutorial >How can I build a robust PHP crawler using DOM manipulation for extracting data from web pages with multiple links?

How can I build a robust PHP crawler using DOM manipulation for extracting data from web pages with multiple links?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-08 07:11:011009browse

How can I build a robust PHP crawler using DOM manipulation for extracting data from web pages with multiple links?

Crawling with PHP: A Comprehensive Guide

To extract data from a web page containing several links, PHP offers various possibilities. One approach involves utilizing regular expressions, but it's essential to avoid relying solely on them for HTML parsing.

DOM-Based Crawler Implementation

Tatu's DOM-based crawler provides a reliable alternative. Here's an improved version:

function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $path = $element->getAttribute('href');
        if (0 !== strpos($path, 'http')) {
            $path = '/' . ltrim($path, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL;
}

This improved version accounts for various url scenarios, including https, user, pass, and port.

Enhancements

George pointed out a bug in the original version, which appends relative urls to the end of the url path instead of overwriting it. Consequently, this issue has been addressed, ensuring that relative urls behave as expected.

Saving Output

The modified version of the crawler echoes its output to STDOUT, allowing you to conveniently redirect it to a file of your choice.

By incorporating these enhancements, this DOM-based crawler provides a robust solution for extracting data from web pages with multiple links in PHP.

The above is the detailed content of How can I build a robust PHP crawler using DOM manipulation for extracting data from web pages with multiple links?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn