search

Home  >  Q&A  >  body text

How to parse and process HTML/XML in PHP?

How to parse HTML/XML and extract information from it?

P粉466643318P粉466643318429 days ago752

reply all(2)I'll reply

  • P粉555696738

    P粉5556967382023-10-12 19:20:54

    TrySimple HTML DOM parser.

    • HTML DOM parser written in PHP 5 that allows you to manipulate HTML in a very easy way!
    • Requires PHP 5.
    • Support invalid HTML.
    • Use selectors to find tags on HTML pages, just like jQuery.
    • Extract content from HTML in one line.
    • download

    Note: As the name suggests, it is useful for simple tasks. It uses regular expressions instead of an HTML parser, so it will be much slower for more complex tasks. Most of its codebase was written in 2008, with only minor improvements made since then. It does not follow modern PHP coding standards and is difficult to incorporate into modern PSR-compliant projects.

    Example:

    How to get HTML elements:

    // Create DOM from URL or file
    $html = file_get_html('http://www.example.com/');
    
    // Find all images
    foreach($html->find('img') as $element)
           echo $element->src . '<br>';
    
    // Find all links
    foreach($html->find('a') as $element)
           echo $element->href . '<br>';

    How to modify HTML elements:

    // Create DOM from string
    $html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
    
    $html->find('div', 1)->class = 'bar';
    
    $html->find('div[id=hello]', 0)->innertext = 'foo';
    
    echo $html;

    Extract content from HTML:

    // Dump contents (without tags) from HTML
    echo file_get_html('http://www.google.com/')->plaintext;

    Grab Slashdot:

    // Create DOM from URL
    $html = file_get_html('http://slashdot.org/');
    
    // Find all article blocks
    foreach($html->find('div.article') as $article) {
        $item['title']     = $article->find('div.title', 0)->plaintext;
        $item['intro']    = $article->find('div.intro', 0)->plaintext;
        $item['details'] = $article->find('div.details', 0)->plaintext;
        $articles[] = $item;
    }
    
    print_r($articles);

    reply
    0
  • P粉115840076

    P粉1158400762023-10-12 17:24:06

    Native XML extension

    I prefer to use one of the native XML extensions because they are generally faster with PHP than all 3rd party libraries and give me all the control I need over the markup.

    DOM

    DOM is capable of parsing and modifying real-world (broken) HTML, it can perform XPath queries < /a>. It is based on libxml.

    Working with DOM takes some time to become productive, but in my opinion, it's worth the time. Since DOM is a language-neutral interface, you'll find implementations in multiple languages, so if you need to change programming languages, you most likely already know how to use that language's DOM API.

    How to use DOM extensions has been covered extensively on StackOverflow, so if and when you choose to use it, you can be sure that most of the problems you encounter can be solved by searching/browsing Stack Overflow.

    Basic usage examples and General concept overview can be found in other answers.

    XMLReader

    XMLReader, like DOM, is based on libxml. I don't know how to trigger the HTML parser module, so using XMLReader to parse corrupted HTML may not be as powerful as using a DOM, where you can explicitly tell it to use libxml's HTML parser module.

    A basic usage example is provided in another answer.

    XML parser The

    XML parser library is also based on libxml and implements a

    SAX style XML push parser. It's probably a better choice than DOM or SimpleXML for memory management, but harder to use than the pull parser implemented by XMLReader.

    SimpleXml

    SimpleXML is an option when you know that the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml as it will block.

    Basic usage examples

    are provided, and there are many other examples in the PHP manual.

    3rd party library (based on libxml)

    If you prefer to use a 3rd party library, I recommend actually using

    DOM

    /libxml below instead of string parsing.

    FluentDom

    HtmlPageDom

    phpQuery

    This is described as "Abandoned software and bugs: use at your own risk" but appears to be minimally maintained.

    laminas-dom

    fDOMDocument

    sabre/xml

    FluidXML


    3rd party (not based on libxml)

    The benefit of building on top of DOM/libxml is that you get good performance out of the box because you're building on native extensions. However, not all third-party libraries go this route. Some of them

    are listed below

    PHP Simple HTML DOM Parser

    I generally do not recommend this parser. The code base is terrible and the parser itself is quite slow and memory intensive. Not all jQuery selectors (such as subselectors) are possible. Any libxml based library should easily outperform this.

    PHP Html parser

    Again, I would not recommend this parser. Quite slow when CPU usage is high. There is also no function to clear the memory of created DOM objects. These problems are especially severe in nested loops. The document itself is inaccurate and contains misspellings, and there has been no fix response since April 14, 2016.


    HTML 5

    You can use the above to parse HTML5, but some weird things may happen due to the tags allowed by HTML5. Therefore, for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so performance will be slower and memory usage increased compared to extensions compiled with lower-level languages.

    HTML5DomDocument

    HTML5


    Regular expression

    Last and least recommended, you can use regular expressionsto extract data from HTML a >. In general, the use of regular expressions on HTML is discouraged.

    Most of the code snippets you find on the web for matching tags are fragile. In most cases, they only work with very specific snippets of HTML. Small markup changes (such as adding a space somewhere, or adding or changing an attribute in the markup) can cause a regular expression to fail when written incorrectly. Before using RegEx on HTML, you should know what you are doing.

    HTML parser already knows the syntax rules of HTML. Regular expressions must be taught for every new regular expression you write. Regular expressions are good in some cases, but it really depends on your use case.

    You can write a more reliable parser , but using regular expressions to write a complete and reliable custom parser when the above libraries already exist and do a better job in this regard Well, that's a waste of time.

    See alsoCthulhu Way Parsing Html< /a>


    books

    If you want to spend some money, you can take a look

    I am not affiliated with PHP architects or authors.

    reply
    0
  • Cancelreply