Home  >  Article  >  Backend Development  >  PHP parsing html class library simple_html_dom

PHP parsing html class library simple_html_dom

WBOY
WBOYOriginal
2016-08-08 09:28:501332browse

Download address: https://github.com/samacs/simple_html_dom
The parser not only helps us verify html documents; it can also parse html documents that do not comply with W3C standards. It uses an element selector similar to jQuery to find and locate elements by their id, class, tag, etc.; it also provides the functions of adding, deleting, and modifying the document tree. Of course, such a powerful html Dom parser is not perfect; you need to be very careful about memory consumption during use. However, don’t worry; in this article, I will explain how to avoid consuming too much memory at the end.
Start using
After uploading a class file, there are three ways to call this class:
Load html document from url
Load html document from string
Load html document from file

. The code is as follows:


// Create a new Dom instance
$html = new simple_html_dom();
// Load from url
$html->load_file('http://www.jb51 .net');
// Load from string
$html->load('Load html document demo from string') ;
//Load from a file
$html->load_file('path/file/test.html');
?>


If you load an html document from a string, you need to download it from the Internet first. It is recommended to use cURL to grab html documents and load them into DOM.
Find html elements
You can use the find function to find elements in html documents. The returned result is an array containing objects. We use functions in the HTML DOM parsing class to access these objects. Here are a few examples:

.The code is as follows:


//Find hyperlink elements in html documents
$a = $html->find('a');
//Find the (N)th hyperlink in the document, if not found, return an empty array.
$a = $html->find(' a', 0);
// Find the div element with the id main Element
$divs = $html->find('div[id]');
// Find all elements containing the id attribute
$divs = $html->find('[id]');
?>

You can also use a jQuery-like selector to find positioned elements:

.The code is as follows:

// Find elements with id='#container'
$ret = $html->find('#container');
// Find all elements with class=foo
$ret = $html->find('.foo');
// Find multiple elements html tag
$ret = $html->find('a, img');
// It can also be used like this
$ret = $html->find('a[title], img[title]') ;
?>

The parser supports searching for sub-elements

.The code is as follows:

// Find all li items in the ul list
$ret = $html->find('ul li');
//Find the li item with specified class=selected in the ul list
$ret = $html->find('ul li.selected');
?>

If you think this is troublesome to use, you can use the built-in function to easily locate the parent element, child element and adjacent element of the element

.The code is as follows:

// Return Parent element
$e->parent;
// Returns an array of child elements
$e->children;
// Returns the specified child element by index number
$e->children(0);
// Returns The first resource speed
$e->first_child ();
// Returns the last child element
$e->last _child ();
// Returns the previous adjacent element
$e->prev_sibling ();
//Return the next adjacent element
$e->next_sibling ();
?>

Element attribute operation


Use simple regular expressions to operate the attribute selector.
[attribute] - selects html elements that contain a certain attribute
[attribute=value] - selects all html elements with specified value attributes
[attribute!=value] - selects all html elements with non-specified value attributes
[attribute^=value] - selects all html elements with attributes starting with the specified value
[attribute$=value] selects all html elements with attributes ending with the specified value
[attribute*=value] - selects all elements containing The html element that specifies the value attribute
calls the element attribute in the parser
The element attribute in the DOM is also an object:

.The code is as follows:

// This example Assign the anchor link value of $a to the $link variable
$link = $a->href;
?>

or:

.The code is as follows:

< ?php

$link = $html->find('a',0)->href;
?


Each object has 4 basic object properties:
tag – returns the html tag name
innertext – returns innerHTML
outertext – returns outerHTML
plaintext – returns the text in the html tag
Edit elements in the parser
The usage of editing element attributes is similar to calling them:

.The code is as follows:


//Assign the anchor link of $a New value
$a->href = 'http://www.jb51.net';
// Delete anchor link
$a->href = null;
// Check whether anchor link exists
if(isset ($a->href)) {
//code
}
?>


There is no special method to add or delete elements in the parser, but you can use it differently:

. The code is as follows:


// Encapsulate element
$e->outertext = '

' . $e->outertext . '
';
// Delete element
$e->outertext = '';
// Add element
$e->outertext = $e->outertext . '
foo
';
// Insert element
$e ->outertext = '
foo
' . $e->outertext;
?


Saving the modified html DOM document is also very simple:

.The code is as follows :


$doc = $html;
// Output
echo $doc;
?>


How to avoid the parser consuming too much memory
In the beginning of this article, the author The problem of the Simple HTML DOM parser consuming too much memory was mentioned. If the php script takes up too much memory, it will cause the website to stop responding and a series of serious problems. The solution is also very simple. After the parser loads the HTML document and uses it, remember to clean up this object. Of course, don't take the problem too seriously. If only 2 or 3 documents are loaded, cleaning or not cleaning them does not make much difference. When you load 5, 10 or more documents, it is absolutely your responsibility to clear the memory after using one ^_^

.The code is as follows:


$html->clear();
?>

The above introduces the PHP parsing html class library simple_html_dom, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn