首页 >后端开发 >Python教程 >如何使用 BeautifulSoup 从网页中提取可见文本？

如何使用 BeautifulSoup 从网页中提取可见文本？

Patricia Arquette原创: 2024-11-17 07:43:03870浏览

How to Extract Visible Text from Webpages with BeautifulSoup?

使用 BeautifulSoup 保留网页中的可见文本

从网页中提取可见文本可能是一项复杂的任务，因为脚本、注释和其他元素经常使内容混乱。为了克服这一挑战，请利用 BeautifulSoup 的 findAll() 函数的强大功能。

识别可见文本

要有效地定位可见文本，请采用以下标准：

忽略
过滤掉 Comment 对象的实例。

实现解决方案

定义可见性过滤器：

from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

提取可见文本：

from bs4 import BeautifulSoup
import urllib.request

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts) 
    return u" ".join(t.strip() for t in visible_texts)

示例用法：

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

输出：

此代码将从指定网页中提取并打印可见文本，不包括脚本、注释、和其他非文本元素。

以上是如何使用 BeautifulSoup 从网页中提取可见文本？的详细内容。更多信息请关注PHP中文网其他相关文章！

beautifulsoup print Filter function this

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：What is the Purpose of Python's "pass" Statement?下一篇：Why Can't I Import cv2 in My Python Program After Installing OpenCV?

查看更多