Home > Article > Backend Development > Python program to extract strings between HTML tags
HTML tags are used to design the framework of the website. We pass information and upload content in the form of strings contained in tags. The string between HTML tags determines how the element is displayed and interpreted by the browser. Therefore, extracting these strings plays a vital role in data manipulation and processing. We can analyze and understand the structure of HTML documents.
These strings reveal the hidden patterns and logic behind building web pages. In this article, we will deal with these strings. Our task is to extract strings between HTML tags.
We need to extract all strings between HTML tags. Our target string is surrounded by different types of tags and only the content part should be retrieved. Let us understand this problem through an example.
Let us consider a string -
Input: Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
The input string consists of different HTML tags, and we need to extract the string between them.
Output: [" This is a test string, Let's code together "]
As we can see, the "
" tags are removed and the string is extracted. Now that we understand the problem, let's discuss a few solutions.
This method focuses on eliminating and replacing HTML tags. We will pass a string and a list of different HTML tags. Afterwards, we will initialize this string to an element of the list.
We will loop through each element in the tag list and check if it exists in the original string. We will pass a "pos" variable which will store the index value and drive the iteration process.
We will use the "replace()" method to replace each tag with a space and get a string without HTML tags.
The Chinese translation ofThe following is an example for extracting strings between HTML tags -
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>" tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"] print(f"This is the original string: {Inp_STR}") ExStr = [Inp_STR] pos = 0 for tag in tags: if tag in ExStr[pos]: ExStr[pos] = ExStr[pos].replace(tag, " ") pos += 1 print(f"The extracted string is : {ExStr}")
This is the original string: <h1>This is a test string,</h1><p>Let's code together</p> The extracted string is : [" This is a test string, Let's code together "]
In this method, we will use the regular expression module to match a specific pattern. We will pass a regular expression: "(.*?)" tag ">" which represents the target pattern. This mode is designed to capture opening and closing tags. Here, "tag" is a variable whose value is obtained from the tag list by iteration.
The "findall()" function is used to find all occurrences of a pattern in a raw string. We will use the "extend()" method to add all "matches" to a new list. In this way, we will extract the string contained in the HTML tag.
The Chinese translation ofThe following is an example -
import re Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>" tags = ["h1", "p", "b", "br"] print(f"This is the original string: {Inp_STR}") ExStr = [] for tag in tags: seq = "<"+tag+">(.*?)</"+tag+">" matches = re.findall(seq, Inp_STR) ExStr.extend(matches) print(f"The extracted string is: {ExStr}")
This is the original string: <h1>This is a test string,</h1><p>Let's code together</p> The extracted string is: ['This is a test string,', "Let's code together"]
In this method, we will use the "find()" method to get the first occurrence of the opening and closing tags in the original string. We will iterate through each element in the tag list and retrieve its position in the string.
A While loop will be used to continue searching for HTML tags in the string. We will build a condition to check if there are incomplete tags in the string. On each iteration, the index value will be updated to find the next occurrence of opening and closing tags.
The index values of all opening and closing tags are stored, and once the entire string is mapped, we use string slicing to extract the string between HTML tags.
The Chinese translation ofThe following is an example -
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>" tags = ["h1", "p", "b", "br"] ExStr = [] print(f"The original string is: {Inp_STR}") for tag in tags: tagpos1 = Inp_STR.find("<"+tag+">") while tagpos1 != -1: tagpos2 = Inp_STR.find("</"+tag+">", tagpos1) if tagpos2 == -1: break ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2]) tagpos1 = Inp_STR.find("<"+tag+">", tagpos2) print(f"The extracted string is: {ExStr}")
The original string is: <h1>This is a test string,</h1><p>Let's code together</p> The extracted string is: ['This is a test string,', "Let's code together"]
In this article, we have discussed many ways to extract strings between HTML tags. Let's start with a simpler solution, locating and replacing tags with spaces. We also used the regular expression module and its findall() function to find matching patterns. We also learned about the find() method and the application of string slicing.
The above is the detailed content of Python program to extract strings between HTML tags. For more information, please follow other related articles on the PHP Chinese website!