Home >Backend Development >Python Tutorial >Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`
When working with HTML data, you often need to clean up the tags and retain only the plain text. Whether it's for data analysis, automation, or simply making content readable, this task is common for developers.
In this article, I'll show you how to create a simple Python class to extract plain text from HTML using HTMLParser, a built-in Python module.
HTMLParser is a lightweight and built-in Python module that allows you to parse and manipulate HTML documents. Unlike external libraries like BeautifulSoup, it's lightweight and ideal for simple tasks like HTML tag cleaning.
from html.parser import HTMLParser class HTMLTextExtractor(HTMLParser): """Class for extracting plain text from HTML content.""" def __init__(self): super().__init__() self.text = [] def handle_data(self, data): self.text.append(data.strip()) def get_text(self): return ''.join(self.text)
Here's how you can use the class to clean up HTML:
raw_description = """ <div> <h1>Welcome to our website!</h1> <p>We offer <strong>exceptional services</strong> for our customers.</p> <p>Contact us at: <a href="mailto:contact@example.com">contact@example.com</a></p> </div> """ extractor = HTMLTextExtractor() extractor.feed(raw_description) description = extractor.get_text() print(description)
Output:
Welcome to our website! We offer exceptional services for our customers.Contact us at: contact@example.com
If you want to capture additional information, such as links in tags, here's an enhanced version of the class:
class HTMLTextExtractor(HTMLParser): """Class for extracting plain text and links from HTML content.""" def __init__(self): super().__init__() self.text = [] def handle_data(self, data): self.text.append(data.strip()) def handle_starttag(self, tag, attrs): if tag == 'a': for attr, value in attrs: if attr == 'href': self.text.append(f" (link: {value})") def get_text(self): return ''.join(self.text)
Enhanced Output:
Welcome to our website!We offer exceptional services for our customers.Contact us at: contact@example.com (link: mailto:contact@example.com)
## Use Cases - **SEO**: Clean HTML tags to analyze the plain text content of a webpage. - **Emails**: Transform HTML emails into plain text for basic email clients. - **Scraping**: Extract important data from web pages for analysis or storage. - **Automated Reports**: Simplify API responses containing HTML into readable text.
## Limitations and Alternatives While `HTMLParser` is simple and efficient, it has some limitations: - **Complex HTML**: It may struggle with very complex or poorly formatted HTML documents. - **Limited Features**: It doesn't provide advanced parsing features like CSS selectors or DOM tree manipulation. ### Alternatives If you need more robust features, consider using these libraries: - **BeautifulSoup**: Excellent for complex HTML parsing and manipulation. - **lxml**: Known for its speed and support for both XML and HTML parsing.
With this solution, you can easily extract plain text from HTML in just a few lines of code. Whether you're working on a personal project or a professional task, this approach is perfect for lightweight HTML cleaning and analysis.
If your use case involves more complex or malformed HTML, consider using libraries like BeautifulSoup or lxml for enhanced functionality.
Feel free to try this code in your projects and share your experiences. Happy coding! ?
The above is the detailed content of Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`. For more information, please follow other related articles on the PHP Chinese website!