Home >Java >javaTutorial >How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

Barbara Streisand
Barbara StreisandOriginal
2024-10-30 23:24:301083browse

How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

Preserving Line Breaks Using Jsoup: A Comprehensive Guide

When converting HTML to plain text, preserving line breaks is crucial to maintain readability. Jsoup, a popular Java HTML parser library, provides an efficient way to extract text from HTML while retaining its structure.

In this guide, we will delve into the specific issue of preserving line breaks when using Jsoup's Jsoup.parse(str).text() method. This method extracts the text content from HTML, but it does not natively preserve line breaks.

Utilizing TextNode.getWholeText()

Initially, the question explored the possibility of using Jsoup's TextNode.getWholeText() method. However, this approach proved ineffective as it does not handle line breaks in the context of HTML tags.

The Effective Solution

The solution to preserving line breaks lies in a more comprehensive approach that involves both pre- and post-processing of the HTML content before extracting the text.

The presented code snippet takes the following steps:

  1. Parses the HTML string using Jsoup.
  2. Disables HTML pretty printing to ensure line breaks are preserved.
  3. Adds line breaks (n) at the end of
    tags and before

    tags.

  4. Replaces the sequence n with actual newlines.
  5. Cleans the modified HTML to remove any remaining formatting or tags.

Implementation

<code class="java">public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\n");
    document.select("p").prepend("\n\n");
    String s = document.html().replaceAll("\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}</code>

Satisfied Requirements

The provided solution fulfills the following requirements:

  • Preserves existing newlines (n) in the HTML.
  • Converts
    and

    tags into newlines.

  • Removes any unwanted formatting or tags in the resulting text.

By implementing this solution, you can effectively preserve line breaks when converting HTML to plain text using Jsoup, ensuring accurate and readable results.

The above is the detailed content of How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn