How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?-javaTutorial-php.cn

Home

Java

javaTutorial

How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

Mary-Kate Olsen

Oct 28, 2024 pm 09:20 PM

How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?

Reading Content from Files Within Zip Achieved with Apache Tika

Challenge:

You aspire to write a Java program that extracts and reads the content of multiple files within a zip archive using Apache Tika. Specifically, your zip file contains a mix of text, PDF, and docx files.

Solution:

public class ZipContentExtractor {

    public static void main(String[] args) throws IOException, SAXException, TikaException {
        File zipFile = new File("C:\Users\xxx\Desktop\abc.zip");

        try (ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(zipFile))) {
            ZipEntry entry;
            while ((entry = zipInputStream.getNextEntry()) != null) {
                // Checking file types
                if (entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf") || entry.getName().endsWith(".docx")) {
                    // Handling text files
                    if (entry.getName().endsWith(".txt")) {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new AutoDetectParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("TXT file content: " + textHandler.toString());
                    }
                    // Handling PDF files
                    else if (entry.getName().endsWith(".pdf")) {
                        Metadata metadata = new Metadata();
                        Parser parser = new PDFParser();
                        parser.parse(zipInputStream, new StreamingContentHandler(), metadata, new ParseContext());
                        System.out.println("PDF file content: " + metadata.get("xmpDM:documentID"));
                    }
                    // Handling DOCX files
                    else {
                        BodyContentHandler textHandler = new BodyContentHandler();
                        Parser parser = new OOXMLParser();
                        parser.parse(zipInputStream, textHandler, new Metadata(), new ParseContext());
                        System.out.println("DOCX file content: " + textHandler.toString());
                    }
                }
            }
        }
    }
}

Explanation:

The code iterates through the entries in the zip file.
For each entry, it checks the file type and handles it appropriately based on the file extension.
For text files, Apache Tika's AutoDetectParser is used to parse the content into a String.
For PDF files, the PDFParser is used to extract metadata, such as the document ID.
For DOCX files, the OOXMLParser is used to parse the content into a String.

The above is the detailed content of How Do I Read Content from Multiple File Types Within a Zip Archive Using Apache Tika?. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Java Platform Independence: What does it mean for developers?May 08, 2025 am 12:27 AM

Java'splatformindependencemeansdeveloperscanwritecodeonceandrunitonanydevicewithoutrecompiling.ThisisachievedthroughtheJavaVirtualMachine(JVM),whichtranslatesbytecodeintomachine-specificinstructions,allowinguniversalcompatibilityacrossplatforms.Howev

How to set up JVM for first usage?May 08, 2025 am 12:21 AM

To set up the JVM, you need to follow the following steps: 1) Download and install the JDK, 2) Set environment variables, 3) Verify the installation, 4) Set the IDE, 5) Test the runner program. Setting up a JVM is not just about making it work, it also involves optimizing memory allocation, garbage collection, performance tuning, and error handling to ensure optimal operation.

How can I check Java platform independence for my product?May 08, 2025 am 12:12 AM

ToensureJavaplatformindependence,followthesesteps:1)CompileandrunyourapplicationonmultipleplatformsusingdifferentOSandJVMversions.2)UtilizeCI/CDpipelineslikeJenkinsorGitHubActionsforautomatedcross-platformtesting.3)Usecross-platformtestingframeworkss

Java Features for Modern Development: A Practical OverviewMay 08, 2025 am 12:12 AM

Javastandsoutinmoderndevelopmentduetoitsrobustfeatureslikelambdaexpressions,streams,andenhancedconcurrencysupport.1)Lambdaexpressionssimplifyfunctionalprogramming,makingcodemoreconciseandreadable.2)Streamsenableefficientdataprocessingwithoperationsli

Mastering Java: Understanding Its Core Features and CapabilitiesMay 07, 2025 pm 06:49 PM

The core features of Java include platform independence, object-oriented design and a rich standard library. 1) Object-oriented design makes the code more flexible and maintainable through polymorphic features. 2) The garbage collection mechanism liberates the memory management burden of developers, but it needs to be optimized to avoid performance problems. 3) The standard library provides powerful tools from collections to networks, but data structures should be selected carefully to keep the code concise.

Can Java be run everywhere?May 07, 2025 pm 06:41 PM

Yes,Javacanruneverywhereduetoits"WriteOnce,RunAnywhere"philosophy.1)Javacodeiscompiledintoplatform-independentbytecode.2)TheJavaVirtualMachine(JVM)interpretsorcompilesthisbytecodeintomachine-specificinstructionsatruntime,allowingthesameJava

What is the difference between JDK and JVM?May 07, 2025 pm 05:21 PM

JDKincludestoolsfordevelopingandcompilingJavacode,whileJVMrunsthecompiledbytecode.1)JDKcontainsJRE,compiler,andutilities.2)JVMmanagesbytecodeexecutionandsupports"writeonce,runanywhere."3)UseJDKfordevelopmentandJREforrunningapplications.

Java features: a quick guideMay 07, 2025 pm 05:17 PM

Key features of Java include: 1) object-oriented design, 2) platform independence, 3) garbage collection mechanism, 4) rich libraries and frameworks, 5) concurrency support, 6) exception handling, 7) continuous evolution. These features of Java make it a powerful tool for developing efficient and maintainable software.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.