首頁 >後端開發 >Python教學 >如何使用 BeautifulSoup 從網頁中提取可見文字？

如何使用 BeautifulSoup 從網頁中提取可見文字？

Patricia Arquette原創: 2024-11-17 07:43:03847瀏覽

How to Extract Visible Text from Webpages with BeautifulSoup?

使用BeautifulSoup 保留網頁中的可見文字

從網頁中提取可見文字可能是一項複雜的任務，因為腳本、註釋和其他元素經常使內容混亂。為了克服這項挑戰，請利用 BeautifulSoup 的 findAll() 函數的強大功能。

辨識可見文本

要有效定位可見文本，請採用以下標準：

忽略
過濾掉 Comment 物件的實例。

實現解決方案

定義可見性過濾器：

from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

擷取可見文字：

from bs4 import BeautifulSoup
import urllib.request

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts) 
    return u" ".join(t.strip() for t in visible_texts)

示例用法：

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

輸出：

此程式碼將從指定網頁中提取並列印可見文本，不包括腳本、註釋、和其他非文字元素。

以上是如何使用 BeautifulSoup 從網頁中提取可見文字？的詳細內容。更多資訊請關注PHP中文網其他相關文章！

beautifulsoup print Filter function this

陳述：

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

上一篇：Python「pass」語句的目的是什麼？下一篇：Python「pass」語句的目的是什麼？

看更多