Home >Backend Development >Python Tutorial >How to Extract Visible Webpage Text Using BeautifulSoup?

How to Extract Visible Webpage Text Using BeautifulSoup?

DDD
DDDOriginal
2024-11-25 18:41:09801browse

How to Extract Visible Webpage Text Using BeautifulSoup?

Extracting Visible Webpage Text with BeautifulSoup

Many web-scraping tasks involve retrieving the visible text content of a webpage, excluding elements like scripts, comments, and CSS styles. Using BeautifulSoup, achieving this can be straightforward with the right approach.

A common issue arises when using the findAll() function, as it retrieves all text nodes, including those hidden within undesirable elements. To address this, we can define a custom filter to exclude specific tags and comments.

The following code exemplifies this approach:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

The tag_visible function checks if the parent element of a text node matches any of the undesirable tags or if the node is a comment. Nodes that pass this filter are then used to combine the visible text into a single string using u" ".join(t.strip() for t in visible_texts).

This approach effectively extracts only the visible text from a webpage, leaving out unnecessary elements like scripts and comments.

The above is the detailed content of How to Extract Visible Webpage Text Using BeautifulSoup?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn