Home  >  Article  >  Backend Development  >  PHP Regular Expressions: How to match all tables in HTML

PHP Regular Expressions: How to match all tables in HTML

WBOY
WBOYOriginal
2023-06-23 09:21:321299browse

When using PHP to process HTML pages, if you need to obtain all table data from the page, you can use regular expressions to achieve this. This article will show you how to use PHP regular expressions to match all tables in HTML.

1. Understand the structure of tables in HTML

When using regular expressions to match tables in HTML, we first need to understand the structure of tables in HTML. A basic HTML table usually contains the following parts:

<table>        <!-- 表格开始标签 -->
    <caption>表格标题</caption>     <!-- 表格标题 -->
    <thead>      <!-- 表头开始标签 -->
        <tr>       <!-- 表头行开始标签 -->
            <th>列名1</th>       <!-- 表头第一列 -->
            <th>列名2</th>       <!-- 表头第二列 -->
            ...
        </tr>       <!-- 表头行结束标签 -->
    </thead>     <!-- 表头结束标签 -->
    <tbody>      <!-- 表格主体开始标签 -->
        <tr>       <!-- 行开始标签 -->
            <td>数据1</td>       <!-- 第一列数据 -->
            <td>数据2</td>       <!-- 第二列数据 -->
            ...
        </tr>       <!-- 行结束标签 -->
        ...
    </tbody>    <!-- 表格主体结束标签 -->
    <tfoot>      <!-- 表格尾部开始标签 -->
        <tr>       <!-- 表尾行开始标签 -->
            <td>统计数据</td>    <!-- 表尾第一列数据 -->
            <td>统计数据</td>    <!-- 表尾第二列数据 -->
            ...
        </tr>       <!-- 表尾行结束标签 -->
    </tfoot>     <!-- 表格尾部结束标签 -->
</table>       <!-- 表格结束标签 -->

2. Use PHP regular expressions to match tables in HTML

With an understanding of the HTML table structure, we can use PHP Regular expression to match the structure of the entire table, the specific steps are as follows:

  1. Use PHP file_get_contents() function to get the source code of the HTML page and save it in a string variable middle.
$url = 'http://www.example.com/';     // HTML 页面的 URL 地址
$html = file_get_contents($url);      // 获取 HTML 页面的源代码
  1. Use regular expressions to match all tables in HTML and save them in an array variable.
preg_match_all('/<table[^>]*>(.*?)</table>/is', $html, $table_arr);

In the above regular expression, /4f8b7a22edf23d5bf38996387821347e]*>(.*?)f16b1740fad44fb09bfe928bcc527e08/is is used for matching Regular expression for HTML tables. Among them, 4f8b7a22edf23d5bf38996387821347e]*> matches the f5d188ed2c074f8b944552db028f98a1 start tag; (.*?) matches everything in the middle; f16b1740fad44fb09bfe928bcc527e08 matches the f5d188ed2c074f8b944552db028f98a1 end tag, /is means . in the regular expression can match any character (including newlines), * means matching zero or more of the preceding characters.

  1. Traverse the array variable $table_arr, obtain the contents of each table, and further parse out each data item.
foreach ($table_arr[0] as $table_html) {
    // 解析出每个表格中的表头、表主体、表尾等内容
    preg_match_all('/<thead[^>]*>(.*?)</thead>.*?<tbody[^>]*>(.*?)</tbody>.*?<tfoot[^>]*>(.*?)</tfoot>/is', $table_html, $table_content);

    // 获取表头数据
    $thead_html = $table_content[1][0];       // 获取表头 HTML 代码
    preg_match_all('/<th[^>]*>(.*?)</th>/is', $thead_html, $thead);      // 匹配表头数据

    // 获取表身数据
    $tbody_html = $table_content[2][0];       // 获取表身 HTML 代码
    preg_match_all('/<tr[^>]*>(.*?)</tr>/is', $tbody_html, $tbody_rows);     // 匹配每一行数据
    foreach ($tbody_rows[1] as $tbody_row_html) {
        preg_match_all('/<td[^>]*>(.*?)</td>/is', $tbody_row_html, $tbody_row);      // 匹配每个单元格
        $tbody_data[] = $tbody_row[1];     // 添加每一行的数据到表身数据数组中
    }

    // 获取表尾数据
    $tfoot_html = $table_content[3][0];       // 获取表尾 HTML 代码
    preg_match_all('/<td[^>]*>(.*?)</td>/is', $tfoot_html, $tfoot);      // 匹配表尾数据
    $tfoot_data = $tfoot[1];

    // 将表格的各个数据保存在其中一个数组中
    $table_data[] = array(
        'thead'     => $thead[1],
        'tbody'     => $tbody_data,
        'tfoot'     => $tfoot_data
    );
}

In the above code, regular expressions are used to match the header, table body, and table footer of each table, and then regular expressions are used to match the data in them. Note that since the data of each table is different, you need to use a foreach loop to process row by row when matching table body and table footer data.

3. Summary

Through the above steps, we can use PHP regular expressions to match all tables in HTML and save the data in array variables. Of course, due to the complexity of the HTML table structure, there may be some inaccuracies in using regular expressions to match the data in it, and it needs to be adjusted according to the actual situation.

The above is the detailed content of PHP Regular Expressions: How to match all tables in HTML. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn