Home  >  Article  >  Backend Development  >  How to convert PDF files to HTML files using Java

How to convert PDF files to HTML files using Java

PHPz
PHPzOriginal
2023-04-26 18:00:163367browse

In modern scenarios, PDF files are a widely used document format. However, sometimes we need to convert PDF files to HTML format for better presentation on web pages. Fortunately, converting PDF files into HTML files is easy with the Java programming language.

This article will introduce the process of converting PDF files into HTML files, using the Java programming language, covering the following content:

  1. Basic concepts and differences between PDF files and HTML files
  2. Java implements basic knowledge of converting PDF files into HTML files
  3. PDFBox library and its use of
  4. HTML file generation
  5. Java implementation of the entire process
  6. Basic concepts and differences between PDF files and HTML files

PDF files (Portable Document Format) are a format for viewing, printing and sharing files on different platforms. The layout and format of PDF files are consistent across platforms, so PDF files can generally be used for publishing, print, and electronic forms.

HTML file (Hypertext Markup Language) is a standard language for building Web pages. HTML files are composed of text, images, links, etc. Browsers can parse HTML files and render them into web pages.

The main difference between PDF files and HTML files is the format layout. The layout of PDF files is fixed, whereas the layout of HTML files dynamically adjusts based on the screen size used in the browser and the user's preferences.

  1. Java Basics of converting PDF files into HTML files

Java is a widely used programming language with a powerful API and a large open source community. Can be used to build a variety of applications. To convert PDF files to HTML files, you need to use Java's PDF library.

The PDF library can parse PDF files and convert them into editable object models. In this way, PDF files can be resized, enhanced or converted. There are many options for PDF libraries used in Java, but this article will use Apache's PDFBox library.

  1. PDFBox Library and Its Usage

PDFBox is an open source Java library from the Apache Software Foundation that can be used to process PDF files. It offers many features including parsing, creating and editing PDF files.

In this example, we will use PDFBox version 2.x. Please note that PDFBox 2.x dependencies require Java 1.8 or higher.

In order to use the PDFBox library, we can add the following Maven dependencies in the build tool:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.21</version>
</dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox-tools</artifactId>
    <version>2.0.21</version>
</dependency>

After downloading all the necessary dependencies in the selected version, we can use the PDFBox library Process PDF files. The next step is to process each page of the PDF file individually and convert them into text.

  1. Generation of HTML files

HTML is a standard markup language used to build web pages. HTML files consist of HTML code and external CSS and JavaScript files. In this example, we will use Java code to generate a complete HTML file.

We use the Freemarker template engine to introduce dynamic content in HTML code. Freemarker is a popular template engine that combines templates and data and generates the final HTML file. The HTML template is as follows:

<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>${title}</title>
<style>
${css}
</style>
</head>
<body>
<div class="content">
${content}
</div>
</body>
<script>
${javascript}
</script>
</html>

Using this template, we can put all the text content of the PDF page into the ${content} variable, and put the stylesheet and script code into ${css} and ${javascript} variables.

  1. Java implementation of the entire process

Now that we have introduced all the necessary steps, we can start writing the Java code to convert PDF files.

import java.io.File;
import java.io.IOException;
import java.io.StringWriter;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.tools.PDFToHTML;

import freemarker.template.Configuration;
import freemarker.template.Template;
import freemarker.template.TemplateException;

public class PDFToHTMLConverter {

    private static final String TEMPLATE_FILE = "src/main/resources/template.html";
    private static final String OUTPUT_DIRECTORY = "./out/";

    public static void main(String[] args) throws IOException, TemplateException {
        String sourcePdf = args[0];
        File file = new File(sourcePdf);
        PDDocument document = PDDocument.load(file);

        PDFTextStripper pdfStripper = new PDFTextStripper();
        int startPage = 1;
        int endPage = document.getNumberOfPages();
        pdfStripper.setStartPage(startPage);
        pdfStripper.setEndPage(endPage);

        StringWriter writer = new StringWriter();
        pdfStripper.writeText(document, writer);

        Configuration freemarkerCfg = new Configuration(Configuration.VERSION_2_3_28);
        freemarkerCfg.setDirectoryForTemplateLoading(new File(""));
        Template template = freemarkerCfg.getTemplate(TEMPLATE_FILE);

        String title = file.getName().replace(".pdf", "");
        String content = writer.toString();

        StringWriter cssWriter = new StringWriter();
        PDFToHTML pdfToHtml = new PDFToHTML();
        pdfToHtml.startConversion(document, cssWriter);

        String css = cssWriter.toString();
        String javascript = "";

        File outputDirectory = new File(OUTPUT_DIRECTORY);
        outputDirectory.mkdirs();

        String htmlFileName = title + ".html";
        File htmlFile = new File(outputDirectory, htmlFileName);

        StringWriter writerHtml = new StringWriter();
        template.process(
            ImmutableMap.of("title", title, "content", content, "css", css, "javascript", javascript),
            writerHtml
        );

        FileUtils.write(htmlFile, writerHtml.toString(), StandardCharsets.UTF_8);

        document.close();
    }
}

In this code, we first load the PDF file using the PDDocument class of the PDFBox library. We then extract the text content from the PDF file using the PDFTextStripper class of the PDFBox library.

Next, we use the Freemarker template engine to generate the HTML file from the HTML template. We also use the PDFToHTML class of the PDFBox library to generate the CSS file while converting the PDF file. Finally, we write all of this content into a complete HTML file.

Usage Example:

java PDFToHTMLConverter.java input.pdf

In this example, we take a PDF file as input and generate an HTML file containing text and CSS.

Done! We have successfully converted PDF files to HTML files.

This article describes how to convert PDF files into HTML files using the Java programming language. We took a deeper look at the differences between PDF files and HTML files, introduced the PDFBox library, and provided sample code for generating HTML files. I believe that readers have mastered the skills of converting PDF files into HTML files and can use them in practice.

The above is the detailed content of How to convert PDF files to HTML files using Java. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn