Home  >  Article  >  Backend Development  >  PHP Regular Expression: How to match all headings in HTML

PHP Regular Expression: How to match all headings in HTML

WBOY
WBOYOriginal
2023-06-22 22:14:521150browse

Using regular expressions to match HTML titles is a common operation in PHP. The title of a web page is usually used to display the general content of the page, making it easier for users to understand and browse. In some cases, we need to extract all titles from HTML for subsequent processing.

This article will introduce how to use PHP regular expressions to quickly and effectively extract all titles in HTML.

1. Classification of HTML titles

In HTML pages, there are many types of titles, which can be defined using the following tags:

  1. h1 ~ h6 tags : used to indicate the level of the title, h1 is the highest and h6 is the lowest;
  2. title tag: used to define the title of the web page, located in the head tag;
  3. meta tag: used to define the meta of the web page Data, often used in search engine optimization.

2. PHP Regular Expressions

Regular expressions are a powerful search and replacement tool that can effectively process text strings. In PHP, we can use preg_match(), preg_match_all(), preg_replace() and other functions to implement regular expression matching.

The following are some commonly used regular expression syntax:

  1. d: Match numeric characters;
  2. w: Match alphanumeric characters and underscores;
  3. s: matches whitespace characters;
  4. ^: matches the beginning of the string;
  5. $: matches the end of the string;
  6. *: matches the previous one of any number Character;
  7. : matches at least one previous character;
  8. ?: matches zero or one previous character;
  9. []: matches a set of characters;
  10. (): Group an expression for subsequent operations.

3. Match all titles in HTML

Below we will introduce how to use PHP regular expressions to match different types of titles in HTML pages.

  1. h1 ~ h6 tags

First, let’s look at how to match the titles in h1 ~ h6 tags. Suppose we have the following HTML code:

<!DOCTYPE html>
<html>
<head>
    <title>HTML 标题示例</title>
</head>
<body>
    <h1>这是一级标题</h1>
    <h2>这是二级标题</h2>
    <h3>这是三级标题</h3>
    <h4>这是四级标题</h4>
    <h5>这是五级标题</h5>
    <h6>这是六级标题</h6>
</body>
</html>

We can use the preg_match_all() function and regular expressions/a89f0e6cefb655e6af53ab7f92340e0c(.*?)44a66cb6e65dacddda1d3f59586c3cc9/, to extract all the titles:

$html = file_get_contents('example.html');
preg_match_all('/<h[1-6]>(.*?)</h[1-6]>/', $html, $matches);
print_r($matches[0]);

In the above code, we use the file_get_contents() function to read the HTML file content, and then use the preg_match_all() function and regular expressions The formula /a89f0e6cefb655e6af53ab7f92340e0c(.*?)8d709ee326a72fb29c36fdf04fb62c17/, to match the h1 ~ h6 titles.

/a89f0e6cefb655e6af53ab7f92340e0c(.*?)8d709ee326a72fb29c36fdf04fb62c17/ in the regular expression means matching h1 ~ The string inside the h6 tag, where (.*?) represents a non-greedy mode, matching as few characters as possible.

The output results are as follows:

Array
(
    [0] => <h1>这是一级标题</h1>
    [1] => <h2>这是二级标题</h2>
    [2] => <h3>这是三级标题</h3>
    [3] => <h4>这是四级标题</h4>
    [4] => <h5>这是五级标题</h5>
    [5] => <h6>这是六级标题</h6>
)

As you can see, we successfully matched all h1 ~ h6 titles in the HTML page.

  1. title tag

Next, let’s look at how to match the title of the web page in the title tag. Suppose we have the following HTML code:

<!DOCTYPE html>
<html>
<head>
    <title>HTML 标题示例</title>
</head>
<body>
    <h1>这是一级标题</h1>
    <p>段落内容</p>
    <h2>这是二级标题</h2>
    <p>段落内容</p>
</body>
</html>

We can use the preg_match() function and the regular expression /b2386ffb911b14667cb8f0f91ea547a7(.*?)6e916e0f7d1e588d4f442bf645aedb2f/, to Extract the webpage title:

$html = file_get_contents('example.html');
preg_match('/<title>(.*?)</title>/', $html, $matches);
echo $matches[1];

In the above code, we use the file_get_contents() function to read the HTML file content, and then use the preg_match() function and regular expression/b2386ffb911b14667cb8f0f91ea547a7(.* ?)6e916e0f7d1e588d4f442bf645aedb2f/ to match the title tag.

/b2386ffb911b14667cb8f0f91ea547a7(.*?)6e916e0f7d1e588d4f442bf645aedb2f/ in the regular expression means matching the string inside the title tag, where (.* ?) indicates non-greedy mode, matching as few characters as possible.

The output results are as follows:

HTML 标题示例

As you can see, we successfully matched the web page title of the HTML page.

  1. meta tag

Finally, let’s look at how to match the metadata in the meta tag. Suppose we have the following HTML code:

<!DOCTYPE html>
<html>
<head>
    <title>HTML 标题示例</title>
    <meta charset="utf-8">
    <meta name="keywords" content="HTML,标题,元数据">
    <meta name="description" content="HTML 标题示例 - 一个简单的 HTML 页面,包含多种类型的标题和元数据。">
</head>
<body>
    <h1>这是一级标题</h1>
    <p>段落内容</p>
    <h2>这是二级标题</h2>
    <p>段落内容</p>
</body>
</html>

We can use the preg_match_all() function and regular expressions/94448c3307a4ee10225239742b439ff7] s)*names*=s*([' "]?)keywords ([^>] s)*>/, to extract the keyword metadata:

$html = file_get_contents('example.html');
preg_match_all('/<metas+([^>]+s)*names*=s*(['"]?)keywords([^>]+s)*>/', $html, $matches);
print_r($matches[0]);

In the above code, we use the file_get_contents() function to read HTML file content, and then use the preg_match_all() function and regular expressions/94448c3307a4ee10225239742b439ff7] s)*names*=s*(['"]?)keywords ([^>] s)*>/ to match the keyword metadata.

/94448c3307a4ee10225239742b439ff7] s)*names*=s*(['"]?)keywords ([^>] s)* in regular expressions >/, means matching the string inside the meta tag whose name attribute is keywords.

The output result is as follows:

Array
(
    [0] => <meta name="keywords" content="HTML,标题,元数据">
)

As you can see, we successfully matched the Keyword metadata.

4. Summary

This article introduces how to use PHP regular expressions to match different types of titles in HTML pages. By using preg_match(), preg_match_all(), Using functions such as preg_replace(), combined with the syntax and rules of regular expressions, we can easily extract relevant information in HTML code for subsequent processing and analysis.

The above is the detailed content of PHP Regular Expression: How to match all headings in HTML. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn