如何在 Python 中从 HTML 文件中提取干净的文本，同时避免正则表达式的陷阱？-Python教程-PHP中文网

首页

后端开发

Python教程

如何在 Python 中从 HTML 文件中提取干净的文本，同时避免正则表达式的陷阱？

Barbara Streisand

Nov 28, 2024 pm 07:53 PM

How Can I Extract Clean Text from HTML Files in Python While Avoiding the Pitfalls of Regular Expressions?

使用 Python 从 HTML 文件中提取干净的文本

当寻求使用 Python 从 HTML 文件中提取文本时，重要的是要考虑鲁棒性和准确性。虽然正则表达式通常可以完成这项工作，但它们可能会遇到格式不良的 HTML。

对于更强大的解决方案，通常建议使用 Beautiful Soup 等库。然而，用户可能会遇到不需要的文本的挑战，例如 JavaScript 源和不正确的 HTML 实体解释。

要解决这些问题，需要更全面的方法。

html2text：一个有前途的解决方案

一个有前途的解决方案是 html2text。该库正确处理 HTML 实体并忽略 JavaScript。然而，它生成 Markdown 而不是纯文本，需要额外的处理来转换它。

利用 BeautifulSoup 和自定义代码

另一种方法是将 BeautifulSoup 与自定义代码。通过删除不需要的元素（例如脚本和样式）并利用 get_text() 方法，您可以获得干净的文本表示形式，而无需仅依赖正则表达式。

以下是演示此方法的 Python 代码片段：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# Remove script and style elements
for script in soup(["script", "style"]):
    script.extract()

# Extract text
text = soup.get_text()

# Additional processing to remove unwanted whitespace and split headlines into separate lines
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

这种方法允许您从 HTML 文件中提取干净的、人类可读的文本，而没有正则表达式或库可能无法处理所有场景的缺点有效。

以上是如何在 Python 中从 HTML 文件中提取干净的文本，同时避免正则表达式的陷阱？的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

列表和阵列之间的选择如何影响涉及大型数据集的Python应用程序的整体性能？May 03, 2025 am 12:11 AM

ForhandlinglargedatasetsinPython,useNumPyarraysforbetterperformance.1)NumPyarraysarememory-efficientandfasterfornumericaloperations.2)Avoidunnecessarytypeconversions.3)Leveragevectorizationforreducedtimecomplexity.4)Managememoryusagewithefficientdata

说明如何将内存分配给Python中的列表与数组。May 03, 2025 am 12:10 AM

Inpython，ListSusedynamicMemoryAllocationWithOver-Asalose，而alenumpyArraySallaySallocateFixedMemory.1）listssallocatemoremoremoremorythanneededinentientary上，respizeTized.2）numpyarsallaysallaysallocateAllocateAllocateAlcocateExactMemoryForements，OfferingPrediCtableSageButlessemageButlesseflextlessibility。

您如何在Python数组中指定元素的数据类型？May 03, 2025 am 12:06 AM

Inpython，YouCansspecthedatatAtatatPeyFelemereModeRernSpant.1）Usenpynernrump.1）Usenpynyp.dloatp.dloatp.ploatm64，formor professisconsiscontrolatatypes。

什么是Numpy，为什么对于Python中的数值计算很重要？May 03, 2025 am 12:03 AM

NumPyisessentialfornumericalcomputinginPythonduetoitsspeed,memoryefficiency,andcomprehensivemathematicalfunctions.1)It'sfastbecauseitperformsoperationsinC.2)NumPyarraysaremorememory-efficientthanPythonlists.3)Itoffersawiderangeofmathematicaloperation

讨论'连续内存分配”的概念及其对数组的重要性。May 03, 2025 am 12:01 AM

Contiguousmemoryallocationiscrucialforarraysbecauseitallowsforefficientandfastelementaccess.1)Itenablesconstanttimeaccess,O(1),duetodirectaddresscalculation.2)Itimprovescacheefficiencybyallowingmultipleelementfetchespercacheline.3)Itsimplifiesmemorym

您如何切成python列表？May 02, 2025 am 12:14 AM

SlicingaPythonlistisdoneusingthesyntaxlist[start:stop:step].Here'showitworks:1)Startistheindexofthefirstelementtoinclude.2)Stopistheindexofthefirstelementtoexclude.3)Stepistheincrementbetweenelements.It'susefulforextractingportionsoflistsandcanuseneg

在Numpy阵列上可以执行哪些常见操作？May 02, 2025 am 12:09 AM

numpyallowsforvariousoperationsonArrays：1）basicarithmeticlikeaddition，减法，乘法和division; 2）evationAperationssuchasmatrixmultiplication; 3）element-wiseOperations wiseOperationswithOutexpliitloops; 4）

Python的数据分析中如何使用阵列？May 02, 2025 am 12:09 AM

Arresinpython，尤其是Throughnumpyandpandas，weessentialFordataAnalysis，offeringSpeedAndeffied.1）NumpyArseNable efflaysenable efficefliceHandlingAtaSetSetSetSetSetSetSetSetSetSetSetsetSetSetSetSetsopplexoperationslikemovingaverages.2）

See all articles