How to parse HTML/XML and extract information from it?
P粉5205457532023-10-13 00:52:20
Note: As the name suggests, it is useful for simple tasks. It uses regular expressions instead of an HTML parser, so it will be much slower for more complex tasks. Most of its codebase was written in 2008, with only minor improvements made since then. It does not follow modern PHP coding standards and is difficult to incorporate into modern PSR-compliant projects.
// Create DOM from URL or file $html = file_get_html('http://www.example.com/'); // Find all images foreach($html->find('img') as $element) echo $element->src . '
'; // Find all links foreach($html->find('a') as $element) echo $element->href . '
';
// Create DOM from string $html = str_get_html('HelloWorld'); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo'; echo $html;
// Dump contents (without tags) from HTML echo file_get_html('http://www.google.com/')->plaintext;
// Create DOM from URL $html = file_get_html('http://slashdot.org/'); // Find all article blocks foreach($html->find('div.article') as $article) { $item['title'] = $article->find('div.title', 0)->plaintext; $item['intro'] = $article->find('div.intro', 0)->plaintext; $item['details'] = $article->find('div.details', 0)->plaintext; $articles[] = $item; } print_r($articles);
P粉6198961452023-10-13 00:47:49
I prefer to use one of the native XML extensions because they are generally faster with PHP than all 3rd party libraries and give me all the control I need over the markup.
DOM is capable of parsing and modifying real-world (broken) HTML, it can perform XPath queries < /a>. It is based on libxml.
Working with DOM takes some time to become productive, but in my opinion, it's worth the time. Since DOM is a language-neutral interface, you'll find implementations in multiple languages, so if you need to change programming languages, you most likely already know how to use that language's DOM API.
How to use DOM extensions has been covered extensively on StackOverflow, so if and when you choose to use it, you can be sure that most of the problems you encounter can be solved by searching/browsing Stack Overflow.
Basic usage examples and General concept overview can be found in other answers.
XMLReader, like DOM, is based on libxml. I don't know how to trigger the HTML parser module, so using XMLReader to parse corrupted HTML may not be as powerful as using a DOM, where you can explicitly tell it to use libxml's HTML parser module.
A basic usage example is provided in another answer.
Basic usage examples
are provided, and there are many other examples in the PHP manual.
3rd party library (based on libxml)/libxml below instead of string parsing.
FluentDomThe benefit of building on top of DOM/libxml is that you get good performance out of the box because you're building on native extensions. However, not all third-party libraries go this route. Some of them
are listed belowI generally do not recommend this parser. The code base is terrible and the parser itself is quite slow and memory intensive. Not all jQuery selectors (such as subselectors) are possible. Any libxml based library should easily outperform this.
Again, I would not recommend this parser. Quite slow when CPU usage is high. There is also no function to clear the memory of created DOM objects. These problems are especially severe in nested loops. The document itself is inaccurate and contains misspellings, and there has been no fix response since April 14, 2016.
You can use the above to parse HTML5, but some weird things may happen due to the tags allowed by HTML5. Therefore, for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so performance will be slower and memory usage increased compared to extensions compiled with lower-level languages.
Last and least recommended, you can use regular expressionsto extract data from HTML a >. In general, the use of regular expressions on HTML is discouraged.
Most of the code snippets you find on the web for matching tags are fragile. In most cases, they only work with very specific snippets of HTML. Small markup changes (such as adding a space somewhere, or adding or changing an attribute in the markup) can cause a regular expression to fail when written incorrectly. Before using RegEx on HTML, you should know what you are doing.
HTML parser already knows the syntax rules of HTML. Regular expressions must be taught for every new regular expression you write. Regular expressions are good in some cases, but it really depends on your use case.
You can write a more reliable parser , but using regular expressions to write a complete and reliable custom parser when the above libraries already exist and do a better job in this regard Well, that's a waste of time.
See alsoCthulhu Way Parsing Html< /a>
If you want to spend some money, you can take a look
I am not affiliated with PHP architects or authors.