在 Python 中从 HTML 内容中提取文本：使用'HTMLParser”的简单解决方案-Python教程-PHP中文网

首页

后端开发

Python教程

在 Python 中从 HTML 内容中提取文本：使用'HTMLParser”的简单解决方案

Patricia Arquette

Dec 10, 2024 am 11:04 AM

Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`

介绍

在处理 HTML 数据时，您经常需要清理标签并仅保留纯文本。无论是用于数据分析、自动化，还是只是使内容可读，此任务对于开发人员来说都很常见。

在本文中，我将向您展示如何创建一个简单的 Python 类，以使用内置 Python 模块 HTMLParser 从 HTML 中提取纯文本。

为什么使用 HTMLParser？

HTMLParser 是一个轻量级的内置 Python 模块，可让您解析和操作 HTML 文档。与 BeautifulSoup 等外部库不同，它是轻量级的，非常适合 HTML 标签清理等简单任务。

解决方案：一个简单的 Python 类

第 1 步：创建 HTMLTextExtractor 类

from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def get_text(self):
        return ''.join(self.text)

这个类主要做了三件事：

初始化列表 self.text 以存储提取的文本。
使用handle_data方法捕获HTML标签之间的所有纯文本。
使用 get_text 方法组合所有文本片段。

第 2 步：使用该类提取文本

以下是如何使用该类来清理 HTML：

raw_description = """
<div>
    <h1 id="Welcome-to-our-website">Welcome to our website!</h1>
    <p>We offer <strong>exceptional services</strong> for our customers.</p>
    <p>Contact us at: <a href="mailto:contact@example.com">contact@example.com</a></p>
</div>
"""

extractor = HTMLTextExtractor()
extractor.feed(raw_description)
description = extractor.get_text()

print(description)

输出：

Welcome to our website! We offer exceptional services for our customers.Contact us at: contact@example.com

添加对属性的支持

如果您想捕获其他信息，例如标签中的链接，这里是该类的增强版本：

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text and links from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr, value in attrs:
                if attr == 'href':
                    self.text.append(f" (link: {value})")

    def get_text(self):
        return ''.join(self.text)

增强输出：

Welcome to our website!We offer exceptional services for our customers.Contact us at: contact@example.com (link: mailto:contact@example.com)

## Use Cases

- **SEO**: Clean HTML tags to analyze the plain text content of a webpage.
- **Emails**: Transform HTML emails into plain text for basic email clients.
- **Scraping**: Extract important data from web pages for analysis or storage.
- **Automated Reports**: Simplify API responses containing HTML into readable text.

这种方法的优点

轻量级：不需要外部库；它基于 Python 的原生 HTMLParser 构建。
易于使用：将逻辑封装在一个简单且可重用的类中。
可定制：轻松扩展功能以捕获属性或附加标签数据等特定信息。

## Limitations and Alternatives

While `HTMLParser` is simple and efficient, it has some limitations:

- **Complex HTML**: It may struggle with very complex or poorly formatted HTML documents.
- **Limited Features**: It doesn't provide advanced parsing features like CSS selectors or DOM tree manipulation.

### Alternatives

If you need more robust features, consider using these libraries:

- **BeautifulSoup**: Excellent for complex HTML parsing and manipulation.
- **lxml**: Known for its speed and support for both XML and HTML parsing.

结论

使用此解决方案，您只需几行代码即可轻松从 HTML 中提取纯文本。无论您是在处理个人项目还是专业任务，这种方法都非常适合轻量级 HTML 清理和分析。

如果您的用例涉及更复杂或格式错误的 HTML，请考虑使用 BeautifulSoup 或 lxml 等库来增强功能。

请随意在您的项目中尝试此代码并分享您的经验。快乐编码！？

以上是在 Python 中从 HTML 内容中提取文本：使用'HTMLParser”的简单解决方案的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

Python的混合方法：编译和解释合并May 08, 2025 am 12:16 AM

pythonuseshybridapprace，ComminingCompilationTobyTecoDeAndInterpretation.1）codeiscompiledtoplatform-Indepententbybytecode.2）bytecodeisisterpretedbybythepbybythepythonvirtualmachine，增强效率和通用性。

了解python的' for”和' then”循环之间的差异May 08, 2025 am 12:11 AM

theKeyDifferencesBetnewpython's“ for”和“ for”和“ loopsare：1）” for“ loopsareIdealForiteringSequenceSquencesSorkNowniterations，而2）”，而“ loopsareBetterforConterContinuingUntilacTientInditionIntionismetismetistismetistwithOutpredefinedInedIterations.un

Python串联列表与重复May 08, 2025 am 12:09 AM

在Python中，可以通过多种方法连接列表并管理重复元素：1)使用运算符或extend()方法可以保留所有重复元素；2)转换为集合再转回列表可以去除所有重复元素，但会丢失原有顺序；3)使用循环或列表推导式结合集合可以去除重复元素并保持原有顺序。

Python列表串联性能：速度比较May 08, 2025 am 12:09 AM

fasteStmethodMethodMethodConcatenationInpythondependersonListsize：1）forsmalllists，operatorseffited.2）forlargerlists，list.extend.extend（）orlistComprechensionfaster，withextendEffaster，withExtendEffers，withextend（）withextend（）是extextend（）asmoremory-ememory-emmoremory-emmoremory-emmodifyinginglistsin-place-place-place。

您如何将元素插入python列表中？May 08, 2025 am 12:07 AM

toInSerteLementIntoApythonList，useAppend（）toaddtotheend，insert（）foreSpificPosition，andextend（）formultiplelements.1）useappend（）foraddingsingleitemstotheend.2）useAddingsingLeitemStotheend.2）useeapecificindex，toadapecificindex，toadaSpecificIndex，toadaSpecificIndex，blyit'ssssssslorist.3 toaddextext.3

Python是否列表动态阵列或引擎盖下的链接列表？May 07, 2025 am 12:16 AM

pythonlistsareimplementedasdynamicarrays，notlinkedlists.1）他们areStoredIncoNtiguulMemoryBlocks，mayrequireRealLealLocationWhenAppendingItems，EmpactingPerformance.2）LinkesedlistSwoldOfferefeRefeRefeRefeRefficeInsertions/DeletionsButslowerIndexeDexedAccess，Lestpypytypypytypypytypy

如何从python列表中删除元素？May 07, 2025 am 12:15 AM

pythonoffersFourmainMethodStoreMoveElement Fromalist：1）删除（值）emovesthefirstoccurrenceofavalue，2）pop（index）emovesanderturnsanelementataSpecifiedIndex，3）delstatementremoveselemsbybybyselementbybyindexorslicebybyindexorslice，and 4）

试图运行脚本时，应该检查是否会遇到'权限拒绝”错误？May 07, 2025 am 12:12 AM

toresolvea“ dermissionded”错误Whenrunningascript，跟随台词：1）CheckAndAdjustTheScript'Spermissions ofchmod xmyscript.shtomakeitexecutable.2）nesureThEseRethEserethescriptistriptocriptibationalocatiforecationAdirectorywherewhereyOuhaveWritePerMissionsyOuhaveWritePermissionsyYouHaveWritePermissions，susteSyAsyOURHomeRecretectory。

See all articles