Home > Article > Backend Development > PHP Regular Expression: How to match all headings in HTML
Using regular expressions to match HTML titles is a common operation in PHP. The title of a web page is usually used to display the general content of the page, making it easier for users to understand and browse. In some cases, we need to extract all titles from HTML for subsequent processing.
This article will introduce how to use PHP regular expressions to quickly and effectively extract all titles in HTML.
1. Classification of HTML titles
In HTML pages, there are many types of titles, which can be defined using the following tags:
2. PHP Regular Expressions
Regular expressions are a powerful search and replacement tool that can effectively process text strings. In PHP, we can use preg_match(), preg_match_all(), preg_replace() and other functions to implement regular expression matching.
The following are some commonly used regular expression syntax:
3. Match all titles in HTML
Below we will introduce how to use PHP regular expressions to match different types of titles in HTML pages.
First, let’s look at how to match the titles in h1 ~ h6 tags. Suppose we have the following HTML code:
<!DOCTYPE html> <html> <head> <title>HTML 标题示例</title> </head> <body> <h1>这是一级标题</h1> <h2>这是二级标题</h2> <h3>这是三级标题</h3> <h4>这是四级标题</h4> <h5>这是五级标题</h5> <h6>这是六级标题</h6> </body> </html>
We can use the preg_match_all() function and regular expressions/a89f0e6cefb655e6af53ab7f92340e0c(.*?)44a66cb6e65dacddda1d3f59586c3cc9/
, to extract all the titles:
$html = file_get_contents('example.html'); preg_match_all('/<h[1-6]>(.*?)</h[1-6]>/', $html, $matches); print_r($matches[0]);
In the above code, we use the file_get_contents() function to read the HTML file content, and then use the preg_match_all() function and regular expressions The formula /a89f0e6cefb655e6af53ab7f92340e0c(.*?)8d709ee326a72fb29c36fdf04fb62c17/
, to match the h1 ~ h6 titles.
/a89f0e6cefb655e6af53ab7f92340e0c(.*?)8d709ee326a72fb29c36fdf04fb62c17/
in the regular expression means matching h1 ~ The string inside the h6 tag, where (.*?)
represents a non-greedy mode, matching as few characters as possible.
The output results are as follows:
Array ( [0] => <h1>这是一级标题</h1> [1] => <h2>这是二级标题</h2> [2] => <h3>这是三级标题</h3> [3] => <h4>这是四级标题</h4> [4] => <h5>这是五级标题</h5> [5] => <h6>这是六级标题</h6> )
As you can see, we successfully matched all h1 ~ h6 titles in the HTML page.
Next, let’s look at how to match the title of the web page in the title tag. Suppose we have the following HTML code:
<!DOCTYPE html> <html> <head> <title>HTML 标题示例</title> </head> <body> <h1>这是一级标题</h1> <p>段落内容</p> <h2>这是二级标题</h2> <p>段落内容</p> </body> </html>
We can use the preg_match() function and the regular expression /b2386ffb911b14667cb8f0f91ea547a7(.*?)6e916e0f7d1e588d4f442bf645aedb2f/
, to Extract the webpage title:
$html = file_get_contents('example.html'); preg_match('/<title>(.*?)</title>/', $html, $matches); echo $matches[1];
In the above code, we use the file_get_contents() function to read the HTML file content, and then use the preg_match() function and regular expression/b2386ffb911b14667cb8f0f91ea547a7(.* ?)6e916e0f7d1e588d4f442bf645aedb2f/
to match the title tag.
/b2386ffb911b14667cb8f0f91ea547a7(.*?)6e916e0f7d1e588d4f442bf645aedb2f/
in the regular expression means matching the string inside the title tag, where (.* ?)
indicates non-greedy mode, matching as few characters as possible.
The output results are as follows:
HTML 标题示例
As you can see, we successfully matched the web page title of the HTML page.
Finally, let’s look at how to match the metadata in the meta tag. Suppose we have the following HTML code:
<!DOCTYPE html> <html> <head> <title>HTML 标题示例</title> <meta charset="utf-8"> <meta name="keywords" content="HTML,标题,元数据"> <meta name="description" content="HTML 标题示例 - 一个简单的 HTML 页面,包含多种类型的标题和元数据。"> </head> <body> <h1>这是一级标题</h1> <p>段落内容</p> <h2>这是二级标题</h2> <p>段落内容</p> </body> </html>
We can use the preg_match_all() function and regular expressions/94448c3307a4ee10225239742b439ff7] s)*names*=s*([' "]?)keywords ([^>] s)*>/
, to extract the keyword metadata:
$html = file_get_contents('example.html'); preg_match_all('/<metas+([^>]+s)*names*=s*(['"]?)keywords([^>]+s)*>/', $html, $matches); print_r($matches[0]);
In the above code, we use the file_get_contents() function to read HTML file content, and then use the preg_match_all() function and regular expressions/94448c3307a4ee10225239742b439ff7] s)*names*=s*(['"]?)keywords ([^>] s)*>/
to match the keyword metadata.
/94448c3307a4ee10225239742b439ff7] s)*names*=s*(['"]?)keywords ([^>] s)* in regular expressions >/
, means matching the string inside the meta tag whose name attribute is keywords.
The output result is as follows:
Array ( [0] => <meta name="keywords" content="HTML,标题,元数据"> )
As you can see, we successfully matched the Keyword metadata.
4. Summary
This article introduces how to use PHP regular expressions to match different types of titles in HTML pages. By using preg_match(), preg_match_all(), Using functions such as preg_replace(), combined with the syntax and rules of regular expressions, we can easily extract relevant information in HTML code for subsequent processing and analysis.
The above is the detailed content of PHP Regular Expression: How to match all headings in HTML. For more information, please follow other related articles on the PHP Chinese website!