如何使用 Python 高效地从 HTML 文件中提取干净的文本？-Python教程-PHP中文网

首页

后端开发

Python教程

如何使用 Python 高效地从 HTML 文件中提取干净的文本？

Patricia Arquette

Nov 29, 2024 am 03:54 AM

How Can I Efficiently Extract Clean Text from HTML Files Using Python?

使用 Python 从 HTML 文件中提取文本：综合指南

简介

提取文本HTML 文件中的数据对于各种数据处理和分析任务至关重要。虽然正则表达式对于简单的 HTML 结构可能是可行的，但它们可能会遇到格式不良的代码。本文探讨了强大的替代方案 - Beautiful Soup - 并提供了一种实用的解决方案，可以有效删除不需要的 JavaScript 并解释 HTML 实体。

使用 Beautiful Soup

使用以下命令提取文本BeautifulSoup，请按照以下步骤操作：

导入 BeautifulSoup库。
使用 urlopen() 打开 HTML 文件。
使用 BeautifulSoup(html, features="html.parser") 创建 BeautifulSoup 对象。
删除不需要的元素 (例如，脚本和样式）与 for script in soup(["script", "style"])： script.extract().
使用 soup.get_text() 提取文本。
将文本分成几行并使用lines = (line.strip() for line in text 去除空格.splitlines()).
用块分隔多标题 = (phrase.strip() for line in rows forphrase in line.split(" ")).
删除带有 text = 'n'.join(chunk for chunk in chunks if chunk) 的空行。

代码示例

这是完整的代码例如：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

其他选项

html2text：处理 HTML 实体并忽略 JavaScript 的替代库。但是，它生成的是 Markdown 而不是纯文本。
lxml:一个强大的 XML 和 HTML 解析器库，还可以在剥离标签后提取文本。

结论

本指南提供了使用从 HTML 文件中提取文本的全面解决方案美丽的汤。通过删除不需要的元素并解释 HTML 实体，它有效地生成纯文本输出以供进一步处理和分析。

以上是如何使用 Python 高效地从 HTML 文件中提取干净的文本？的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

Python：深入研究汇编和解释May 12, 2025 am 12:14 AM

pythonisehybridmodelofcompilationand interpretation：1）thepythoninterspretercompilesourcececodeintoplatform- interpententbybytecode.2）thepytythonvirtualmachine（pvm）thenexecuteCutestestestesteSteSteSteSteSteSthisByTecode，BelancingEaseofuseWithPerformance。

Python是一种解释或编译语言，为什么重要？May 12, 2025 am 12:09 AM

pythonisbothinterpretedAndCompiled.1）它的compiledTobyTecodeForportabilityAcrosplatforms.2）bytecodeisthenInterpreted，允许fordingfordforderynamictynamictymictymictymictyandrapiddefupment，尽管Ititmaybeslowerthananeflowerthanancompiledcompiledlanguages。

对于python中的循环时循环与循环：解释了关键差异May 12, 2025 am 12:08 AM

在您的知识之际，而foroopsareideal insinAdvance中，而WhileLoopSareBetterForsituations则youneedtoloopuntilaconditionismet

循环时：实用指南May 12, 2025 am 12:07 AM

ForboopSareSusedwhenthentheneMberofiterationsiskNownInAdvance，而WhileLoopSareSareDestrationsDepportonAcondition.1）ForloopSareIdealForiteratingOverSequencesLikelistSorarrays.2）whileLeleLooleSuitableApeableableableableableableforscenarioscenarioswhereTheLeTheLeTheLeTeLoopContinusunuesuntilaspecificiccificcificCondond

Python：它是真正的解释吗？揭穿神话May 12, 2025 am 12:05 AM

pythonisnotpuroly interpred; itosisehybridablectofbytecodecompilationandruntimeinterpretation.1）PythonCompiLessourceceCeceDintobyTecode，whitsthenexecececected bytybytybythepythepythepythonvirtirtualmachine（pvm）.2）

与同一元素的Python串联列表May 11, 2025 am 12:08 AM

concateNateListsinpythonwithTheSamelements，使用：1）operatototakeepduplicates，2）asettoremavelemavphicates，or3）listCompreanspearensionforcontroloverduplicates，每个methodhasdhasdifferentperferentperferentperforentperforentperforentperfortenceandordormplications。

解释与编译语言：Python的位置May 11, 2025 am 12:07 AM

pythonisanterpretedlanguage，offeringosofuseandflexibilitybutfacingperformancelanceLimitationsInCricapplications.1）drightingedlanguageslikeLikeLikeLikeLikeLikeLikeLikeThonexecuteline-by-line，允许ImmediaMediaMediaMediaMediaMediateFeedBackAndBackAndRapidPrototypiD.2）compiledLanguagesLanguagesLagagesLikagesLikec/c thresst

循环时：您什么时候在Python中使用？May 11, 2025 am 12:05 AM

Useforloopswhenthenumberofiterationsisknowninadvance,andwhileloopswheniterationsdependonacondition.1)Forloopsareidealforsequenceslikelistsorranges.2)Whileloopssuitscenarioswheretheloopcontinuesuntilaspecificconditionismet,usefulforuserinputsoralgorit

See all articles