Home  >  Article  >  Java  >  How to Preserve Line Breaks When Converting HTML to Text Using Jsoup?

How to Preserve Line Breaks When Converting HTML to Text Using Jsoup?

DDD
DDDOriginal
2024-10-31 20:37:29147browse

How to Preserve Line Breaks When Converting HTML to Text Using Jsoup?

Preserving Line Breaks in HTML-to-Text Conversion Using Jsoup

When converting HTML to plain text using jsoup, preserving line breaks can be crucial for maintaining the readability and structure of the output. By default, jsoup's text() method does not retain line breaks present in the HTML code.

Solution:

To preserve line breaks effectively, utilize the br2nl() method, which incorporates the following enhancements:

  1. Preserve Existing Newlines: If the original HTML contains newline characters (n), they are preserved in the output.
  2. Convert
    and

    Tags: Line breaks are introduced by appending n to the contents of
    tags. Additionally, nn is prepended to the contents of

    tags to signify new paragraphs.

  3. Post-Processing: The modified HTML is rendered, and any remaining escaped newlines (\n) are converted to actual newlines (n). To ensure proper handling of other special characters, the resultant string is cleaned using Jsoup.clean().

Usage:

<code class="java">import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class LineBreakPreserver {

    public static String br2nl(String html) {
        if (html == null) {
            return html;
        }

        Document document = Jsoup.parse(html);
        document.outputSettings(new Document.OutputSettings().prettyPrint(false));
        document.select("br").append("\n");
        document.select("p").prepend("\n\n");
        String s = document.html().replaceAll("\\n", "\n");
        return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
    }

    public static void main(String[] args) {
        String html = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
                "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

        String result = br2nl(html);
        System.out.println(result);
    }
}</code>

Output:

hello world
yo googlez

The above is the detailed content of How to Preserve Line Breaks When Converting HTML to Text Using Jsoup?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn