Home > Article > Web Front-end > PHP code to parse HTML using DiDOM
Every now and then developers need to crawl web pages to get some information from the website. For example, let's say you are working on a personal project where you have to get geographic information about the capitals of different countries from Wikipedia. Manual entry takes a lot of time. However, you can do this very quickly with the help of PHP by scraping Wikipedia pages. You can also automatically parse HTML for specific information without having to manually browse through the entire markup.
In this tutorial, we'll take a look at a fast and easy-to-use HTML parser called DiDOM. We will start with the installation process and then learn how to use different types of selectors (such as tags, classes, etc.) to extract information from different elements on the web page.
You can easily install DiDOM in your project directory by running the following command:
composer require imangazaliev/didom
After running the above command, you will be able to load HTML from a string, local file, or web page. Here is an example:
require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document($washington_dc_html_string); $document = new Document('washington_dc.html', true); $url = 'https://en.wikipedia.org/wiki/Washington,_D.C.'; $document = new Document($url, true);
When you decide to parse the HTML from the document, it may have already been loaded and stored in a variable. In this case, you just pass that variable to Document()
and DiDOM will prepare the string to be parsed.
If HTML must be loaded from a file or URL, you can pass it as the first argument to Document()
and set the second argument to true
.
You can also create a new Document
object using new Document()
without any parameters. In this case, you can call the method loadHtml()
to load HTML from a string, and loadHtmlFile()
to load HTML from a file or web page.
The first thing to do before getting HTML or text from an element is to find the element itself. The simplest way is to use the find()
method and pass the CSS selector of the desired element as the first argument.
You can also pass the element's XPath as the first argument to the find()
method. However, this requires you to pass Query::TYPE_XPATH
as the second parameter.
If you only want to use XPath values to find HTML elements, you can simply use the xpath()
method instead of passing Query::TYPE_XPATH
each time as The second parameter of find()
.
If DiDOM can find an element that matches the passed CSS selector or XPATH expression, it will return an array of DiDom\Element
instances. If no such element is found, it returns an empty array.
Since these methods return an array, you can use find()[n-1]
to directly access the nth matching element.
In the example below we will get the inner HTML from all the first and second level headings in the Wikipedia article about Washington DC
require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true); $main_heading = $document->find('h1.firstHeading')[0]; echo $main_heading->html(); $sub_headings = $document->find('h2'); foreach($sub_headings as $sub_heading) { if($sub_heading->text() !== 'See also') { echo $sub_heading->html(); } else { break; } }
We first create a new Document object by passing the URL of the Wikipedia article about Washington, DC. After that, we use the find()
method to get the main heading element and store it inside a variable named $main_heading
. We can now call different methods on this element such as text()
, innerHtml()
, html()
etc
For the main title, we simply call the html()
method to return the HTML of the entire title element. Likewise, we can get the HTML inside a specific element using the innerHtml()
method. Sometimes you are more interested in an element's plain text content rather than its HTML. In this case, you can just use the text()
method.
Secondary headings divide our Wikipedia page into well-defined sections. However, you may want to remove some of these subheadings, such as "See Also," "Notes," etc.
One way is to loop through all secondary headings and check the value returned by the text()
method. If the returned title text is "See Also" we will break out of the loop.
Use $document->find('h2')[3]
and $document- to directly reach the fourth or sixth level secondary title>find('h2' )[5]
respectively.
Once you have access to a specific element, the library lets you traverse up and down the DOM tree to easily access other elements.
You can use the parent()
method to go to the parent element of an HTML element. Likewise, you can get the next or previous sibling of an element using the nextSibling()
and previousSibling()
methods.
还有很多方法可用于访问 DOM 元素的子元素。例如,您可以使用 child(n)
方法获取特定的子元素。同样,您可以使用 firstChild()
和 lastChild()
方法访问特定元素的第一个或最后一个子元素。您可以使用 children()
方法循环遍历特定 DOM 元素的所有子元素。
一旦到达特定元素,您将能够使用 html()
、innerHtml()
和text()
方法。
在下面的示例中,我们从二级标题元素开始,并继续检查下一个同级元素是否包含一些文本。一旦我们找到带有一些文本的同级元素,我们就会将其输出到浏览器。
require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true); $sub_headings = $document->find('h2'); for($i = 1; $i < count($sub_headings); $i++) { if($sub_headings[$i]->text() !== 'See also') { $next_sibling = $sub_headings[$i]->nextSibling(); while(!$next_elem->html()) { $next_sibling = $next_sibling->nextSibling(); } echo $next_elem->html()."<br>"; } else { break; } }
您可以使用类似的技术循环遍历所有同级元素,并且仅在文本包含特定字符串或同级元素是段落标记等时输出文本。一旦您了解了基础知识,找到正确的信息就是简单的。
在某些情况下,获取或设置不同元素的属性值的能力非常有用。例如,我们可以使用 $image_elem->attr( 'src')
.以类似的方式,您可以获得文档中所有 a 标记的 href 属性的值。
可以通过三种方法获取 HTML 元素的给定属性的值。您可以使用 getAttribute('attrName')
方法并将您感兴趣的属性名称作为参数传递。您还可以使用 attr('attrName') 方法,其工作方式与 getAttribute() 类似。最后,该库还允许您使用 $elem->attrName
直接获取属性值。这意味着您可以使用 $imageElem->src
直接获取图像元素的 src 属性值。
require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true); $images = $document->find('img'); foreach($images as $image) { echo $image->src."<br>"; }
一旦您有权访问src属性,您就可以编写代码来自动下载所有图像文件。这样,您将能够节省大量时间。
您还可以使用三种不同的技术来设置给定属性的值。首先,您可以使用 setAttribute('attrName', 'attrValue') 方法来设置属性值。您还可以使用 attr('attrName', 'attrValue') 方法来设置属性值。最后,您可以使用 $Elem->attrName = 'attrValue'
设置给定元素的属性值。
您还可以使用库提供的不同方法对加载的 HTML 文档进行更改。例如,您可以使用 appendChild()
、replace()
和 从 DOM 树添加、替换或删除元素">删除()
方法。
该库还允许您创建自己的 HTML 元素,以便将它们附加到原始 HTML 文档中。您可以使用 new Element('tagName', 'tagContent')
创建新的 Element 对象。
请记住,如果您的程序在实例化之前不包含行 use DiDom\Element
,您将收到未捕获错误:未找到“Element”类错误元素对象。
获得该元素后,您可以使用 appendChild()
方法将其附加到 DOM 中的其他元素,也可以使用 replace( )
方法使用新实例化的元素来替换文档中某些旧的 HTML 元素。下面的例子应该有助于进一步阐明这个概念。
require_once('vendor/autoload.php'); use DiDom\Document; use DiDom\Element; $document = new Document('https://en.wikipedia.org/wiki/Washington,_D.C.', true); // This will result in error. echo $document->find('h2.test-heading')[0]->html()."\n"; $test_heading = new Element('h2', 'This is test heading.'); $test_heading->class = 'test-heading'; $document->find('h1')[0]->replace($test_heading); echo $document->find('h2.test-heading')[0]->html()."\n";
最初,我们的文档中没有 test-heading 类的 h2 元素。因此,如果我们尝试访问这样的元素,我们将不断收到错误。
验证不存在这样的元素后,我们创建一个新的h2元素,并将其class属性的值更改为test-heading >.
之后,我们将文档中的第一个 h1 元素替换为新创建的 h2 元素。再次在我们的文档中使用 find()
方法查找带有 test-heading 类的 h2 标题,现在将返回一个元素。
本教程介绍了 PHP DiDOM HTML 解析器的基础知识。我们从安装开始,然后学习如何从字符串、文件或 URL 加载 HTML。之后,我们讨论了如何根据 CSS 选择器或 XPath 查找特定元素。我们还学习了如何获取元素的兄弟元素、父元素或子元素。其余部分介绍了如何操作特定元素的属性或在 HTML 文档中添加、删除和替换元素。
如果您希望我在教程中澄清任何内容,请随时在评论中告诉我。
The above is the detailed content of PHP code to parse HTML using DiDOM. For more information, please follow other related articles on the PHP Chinese website!