首頁

後端開發

Python教學

Python程式提取HTML標籤之間的字串

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 19, 2023 am 09:37 AM

python提取html標籤字串

Python程式提取HTML標籤之間的字串

HTML標籤用於設計網站的框架。我們透過在標籤中包含的字串形式傳遞訊息和上傳內容。 HTML標籤之間的字串決定了元素在瀏覽器中的顯示和解釋方式。因此，提取這些字串在資料操作和處理中起著至關重要的作用。我們可以分析並理解HTML文件的結構。

這些字串揭示了建立網頁背後的隱藏模式和邏輯。在本文中，我們將處理這些字串。我們的任務是提取HTML標籤之間的字串。

理解問題

我們需要提取在HTML標記之間的所有字串。我們的目標字串被不同類型的標記包圍，只有內容部分應該被檢索出來。讓我們透過一個例子來理解這個問題。

輸入輸出場景

讓我們考慮一個字串 -

Input:
Inp_STR = "<h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>"

輸入字串由不同的HTML標籤組成，我們需要提取它們之間的字串。

Output: [" This is a test string,  Let's code together "]

正如我們所看到的，"

"和"

"標籤被移除了，字串被提取出來。現在我們已經理解了問題，讓我們討論幾個解決方案。

使用迭代和替換()

這種方法專注於消除和取代HTML標籤。我們將傳遞一個字串和一個不同HTML標籤的清單。之後，我們將將此字串初始化為列表的一個元素。

我們將遍歷標籤清單中的每個元素，並檢查它是否存在於原始字串中。我們將傳遞一個“pos”變量，它將儲存索引值並驅動迭代過程。

我們將使用「replace()」方法將每個標籤替換為一個空格，並取得一個沒有HTML標籤的字串。

Example

的中文翻譯為：

範例

以下是一個範例，用於提取HTML標籤之間的字串 -

Inp_STR = "<h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1 id="">", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]
print(f"This is the original string: {Inp_STR}")
ExStr = [Inp_STR]
pos = 0

for tag in tags:
   if tag in ExStr[pos]:
      ExStr[pos] = ExStr[pos].replace(tag, " ")
pos += 1

print(f"The extracted string is : {ExStr}")

輸出

This is the original string: <h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>
The extracted string is : [" This is a test string,  Let's code together "]

使用正規表示式模組 findall()

在這種方法中，我們將使用正規表示式模組來匹配特定的模式。我們將傳遞一個正規表示式：“(.*?)" tag ">”，該表達式表示目標模式。此模式旨在捕獲開放和關閉標籤。在這裡，「tag」是一個變量，透過迭代從標籤列表中獲取其值。

「findall()」函數用於在原始字串中找到模式的所有符合項目。我們將使用「extend()」方法將所有的「匹配項」新增到一個新的清單中。透過這種方式，我們將提取出HTML標籤中包含的字串。

Example

的中文翻譯為：

範例

以下是一個範例 -

import re
Inp_STR = "<h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
print(f"This is the original string: {Inp_STR}")
ExStr = []

for tag in tags:
   seq = "<"+tag+">(.*?)</"+tag+">"
   matches = re.findall(seq, Inp_STR)
   ExStr.extend(matches)
print(f"The extracted string is: {ExStr}")

輸出

This is the original string: <h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

使用迭代和find()函數

在這個方法中，我們將使用「find()」方法來取得原始字串中開放和關閉標籤的第一個出現。我們將遍歷標籤清單中的每個元素，並檢索其在字串中的位置。

將使用While循環來繼續在字串中搜尋HTML標籤。我們將建立一個條件來檢查字串中是否存在不完整的標籤。在每次迭代中，索引值將被更新以找到下一個開放和關閉標籤的出現。

所有開放和關閉標籤的索引值都被存儲，一旦整個字串被映射，我們使用字串切片來提取HTML標籤之間的字串。

Example

的中文翻譯為：

範例

以下是一個範例 -

Inp_STR = "<h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
ExStr = []
print(f"The original string is: {Inp_STR}")

for tag in tags:
   tagpos1 = Inp_STR.find("<"+tag+">")
   while tagpos1 != -1:
      tagpos2 = Inp_STR.find("</"+tag+">", tagpos1)
      if tagpos2 == -1:
         break
      ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2])
      tagpos1 = Inp_STR.find("<"+tag+">", tagpos2)

print(f"The extracted string is: {ExStr}")

輸出

The original string is: <h1 id="This-is-a-test-string">This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]