Home  >  Article  >  Web Front-end  >  poi html to word

poi html to word

WBOY
WBOYOriginal
2023-05-15 22:56:391390browse

With the continuous development of Internet information technology, we increasingly need to convert HTML pages into Word documents for editing, typesetting, printing, etc. This article will introduce how to use the POI library to convert HTML pages into Word documents, and provide some practical code examples.

1. Introduction to POI

POI is the abbreviation of "Poor Obfuscation Implementation". It is an open source project under the Apache Software Foundation and is dedicated to Microsoft Office (including Word, Excel, PowerPoint etc.) developed a set of Java API. Currently, POI has become one of the standard libraries for creating, reading/writing Microsoft Office documents in Java development, and many Java programs use it to operate Office documents.

2. The basic process of creating a Word document with POI

Before using POI to create a Word document, we need to first understand the basic process of creating a Word document.

  1. Create an empty Word document

Create an empty Word document by using the XWPFDocument class provided by POI.

XWPFDocument doc = new XWPFDocument();
  1. Operation of Word document content

The operation of Word document content is implemented through the XWPFParagraph and XWPFRun classes provided by POI, specifically including:

(1 ) Create a paragraph

XWPFParagraph para = doc.createParagraph();

(2) Create text

XWPFRun run = para.createRun();
run.setText("Hello World!");
  1. Write the Word document to the file

Use the write method provided by the XWPFDocument class to write the Word document Write to file.

FileOutputStream out = new FileOutputStream("output.docx");
doc.write(out);
out.close();

3. Convert HTML to Word document

Above we have briefly introduced the basic process of using POI to create a Word document. Below we will introduce how to use POI to convert HTML pages into Word documents.

  1. Get the content of the HTML page

We can use the URLConnection class provided by Java to get the content of the HTML page, as shown below:

String urlStr = "http://www.baidu.com";
URL url = new URL(urlStr);
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
StringBuffer sb = new StringBuffer();
while((line = br.readLine()) != null){
    sb.append(line);
}  
String html = sb.toString();
  1. HTML page parsing

Parse the obtained HTML page content, and use the Jsoup library to realize the parsing of the HTML page, as shown below:

Document docHtml = Jsoup.parse(html);
  1. Word document content Create

(1) Create a blank Word document and use POI's XWPFDocument class

XWPFDocument docx = new XWPFDocument();

(2) Get all paragraphs in the HTML page

Elements parags = docHtml.getElementsByTag("p");

(3) Convert paragraphs of HTML page to paragraphs of Word document

for(Element p : parags){
    XWPFParagraph paragraph = docx.createParagraph();// 新建一个段落
    XWPFRun run = paragraph.createRun();// 在该段落中创建一个文本片段,即 XWPFRun
    run.setText(p.text());// 设置该文本片段的文字内容
}
  1. Write Word document to disk

Finally, we will write the created Word document to disk for subsequent use use.

OutputStream os = new FileOutputStream("output.docx");
docx.write(os);
os.close();

4. Complete code example

The following is a complete code example for converting an HTML page into a Word document:

import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import org.apache.poi.*;
import org.apache.poi.xwpf.usermodel.*;

public class Html2Word {
    public static void main(String[] args) throws Exception {
        String urlStr = "http://www.baidu.com"; //待转换的HTML页面链接地址
        URL url = new URL(urlStr);
        URLConnection conn = url.openConnection();
        InputStream is = conn.getInputStream();
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line = null;
        StringBuffer sb = new StringBuffer();
        while((line = br.readLine()) != null){
            sb.append(line);
        }
        String html = sb.toString();
        Document docHtml = Jsoup.parse(html);
        Elements parags = docHtml.getElementsByTag("p"); //获取HTML页面中的所有段落
        XWPFDocument docx = new XWPFDocument(); //使用POI的XWPFDocument类创建空白Word文档
        for(Element p : parags){
            XWPFParagraph paragraph = docx.createParagraph(); //新建一个段落
            XWPFRun run = paragraph.createRun(); //在该段落中创建一个文本片段,即 XWPFRun
            run.setText(p.text()); //设置该文本片段的文字内容
        }
        OutputStream os = new FileOutputStream("output.docx");
        docx.write(os);
        os.close();
    }
}

5. Summary

Passed From the above introduction, we can see that using POI to convert HTML pages into Word documents is a very practical function. It can help us quickly and accurately process various text contents in our daily work. POI encapsulates some Java APIs for operating Office software, which can help us operate Word, Excel and other document formats more conveniently, improve our work efficiency, and bring more convenience to our work.

The above is the detailed content of poi html to word. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:html escape jsNext article:html escape js