Home >Java >javaTutorial >How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-10-28 21:20:30850browse

How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

Reading Content from Files Within Zip Achieved with Apache Tika

Challenge:

You aspire to write a Java program that extracts and reads the content of multiple files within a zip archive using Apache Tika. Specifically, your zip file contains a mix of text, PDF, and docx files.

Solution:

public class ZipContentExtractor {

    public static void main(String[] args) throws IOException, SAXException, TikaException {
        File zipFile = new File("C:\Users\xxx\Desktop\abc.zip");

        try (ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(zipFile))) {
            ZipEntry entry;
            while ((entry = zipInputStream.getNextEntry()) != null) {
                // Checking file types
                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) {
                    // Handling text files
                    if (entry.getName().endsWith(".txt")) {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new AutoDetectParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("TXT file content: " + textHandler.toString());
                    }
                    // Handling PDF files
                    else if (entry.getName().endsWith(".pdf")) {
                        Metadata metadata = new Metadata();
                        Parser parser = new PDFParser();
                        parser.parse(zipInputStream, new StreamingContentHandler(), metadata, new ParseContext());
                        System.out.println("PDF file content: " + metadata.get("xmpDM:documentID"));
                    }
                    // Handling DOCX files
                    else {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new OOXMLParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("DOCX file content: " + textHandler.toString());
                    }
                }
            }
        }
    }
}

Explanation:

  • The code iterates through the entries in the zip file.
  • For each entry, it checks the file type and handles it appropriately based on the file extension.
  • For text files, Apache Tika's AutoDetectParser is used to parse the content into a String.
  • For PDF files, the PDFParser is used to extract metadata, such as the document ID.
  • For DOCX files, the OOXMLParser is used to parse the content into a String.

The above is the detailed content of How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn