Home > Article > Backend Development > PHP Regular Expressions: How to match all tables in HTML
When using PHP to process HTML pages, if you need to obtain all table data from the page, you can use regular expressions to achieve this. This article will show you how to use PHP regular expressions to match all tables in HTML.
1. Understand the structure of tables in HTML
When using regular expressions to match tables in HTML, we first need to understand the structure of tables in HTML. A basic HTML table usually contains the following parts:
<table> <!-- 表格开始标签 --> <caption>表格标题</caption> <!-- 表格标题 --> <thead> <!-- 表头开始标签 --> <tr> <!-- 表头行开始标签 --> <th>列名1</th> <!-- 表头第一列 --> <th>列名2</th> <!-- 表头第二列 --> ... </tr> <!-- 表头行结束标签 --> </thead> <!-- 表头结束标签 --> <tbody> <!-- 表格主体开始标签 --> <tr> <!-- 行开始标签 --> <td>数据1</td> <!-- 第一列数据 --> <td>数据2</td> <!-- 第二列数据 --> ... </tr> <!-- 行结束标签 --> ... </tbody> <!-- 表格主体结束标签 --> <tfoot> <!-- 表格尾部开始标签 --> <tr> <!-- 表尾行开始标签 --> <td>统计数据</td> <!-- 表尾第一列数据 --> <td>统计数据</td> <!-- 表尾第二列数据 --> ... </tr> <!-- 表尾行结束标签 --> </tfoot> <!-- 表格尾部结束标签 --> </table> <!-- 表格结束标签 -->
2. Use PHP regular expressions to match tables in HTML
With an understanding of the HTML table structure, we can use PHP Regular expression to match the structure of the entire table, the specific steps are as follows:
file_get_contents()
function to get the source code of the HTML page and save it in a string variable middle. $url = 'http://www.example.com/'; // HTML 页面的 URL 地址 $html = file_get_contents($url); // 获取 HTML 页面的源代码
preg_match_all('/<table[^>]*>(.*?)</table>/is', $html, $table_arr);
In the above regular expression, /4f8b7a22edf23d5bf38996387821347e]*>(.*?)f16b1740fad44fb09bfe928bcc527e08/is
is used for matching Regular expression for HTML tables. Among them, 4f8b7a22edf23d5bf38996387821347e]*>
matches the f5d188ed2c074f8b944552db028f98a1
start tag; (.*?)
matches everything in the middle; f16b1740fad44fb09bfe928bcc527e08
matches the f5d188ed2c074f8b944552db028f98a1
end tag, /is
means .
in the regular expression can match any character (including newlines), *
means matching zero or more of the preceding characters.
$table_arr
, obtain the contents of each table, and further parse out each data item. foreach ($table_arr[0] as $table_html) { // 解析出每个表格中的表头、表主体、表尾等内容 preg_match_all('/<thead[^>]*>(.*?)</thead>.*?<tbody[^>]*>(.*?)</tbody>.*?<tfoot[^>]*>(.*?)</tfoot>/is', $table_html, $table_content); // 获取表头数据 $thead_html = $table_content[1][0]; // 获取表头 HTML 代码 preg_match_all('/<th[^>]*>(.*?)</th>/is', $thead_html, $thead); // 匹配表头数据 // 获取表身数据 $tbody_html = $table_content[2][0]; // 获取表身 HTML 代码 preg_match_all('/<tr[^>]*>(.*?)</tr>/is', $tbody_html, $tbody_rows); // 匹配每一行数据 foreach ($tbody_rows[1] as $tbody_row_html) { preg_match_all('/<td[^>]*>(.*?)</td>/is', $tbody_row_html, $tbody_row); // 匹配每个单元格 $tbody_data[] = $tbody_row[1]; // 添加每一行的数据到表身数据数组中 } // 获取表尾数据 $tfoot_html = $table_content[3][0]; // 获取表尾 HTML 代码 preg_match_all('/<td[^>]*>(.*?)</td>/is', $tfoot_html, $tfoot); // 匹配表尾数据 $tfoot_data = $tfoot[1]; // 将表格的各个数据保存在其中一个数组中 $table_data[] = array( 'thead' => $thead[1], 'tbody' => $tbody_data, 'tfoot' => $tfoot_data ); }
In the above code, regular expressions are used to match the header, table body, and table footer of each table, and then regular expressions are used to match the data in them. Note that since the data of each table is different, you need to use a foreach
loop to process row by row when matching table body and table footer data.
3. Summary
Through the above steps, we can use PHP regular expressions to match all tables in HTML and save the data in array variables. Of course, due to the complexity of the HTML table structure, there may be some inaccuracies in using regular expressions to match the data in it, and it needs to be adjusted according to the actual situation.
The above is the detailed content of PHP Regular Expressions: How to match all tables in HTML. For more information, please follow other related articles on the PHP Chinese website!