Compression algorithms in Parquet Java-javaTutorial-php.cn

Home

Java

javaTutorial

Compression algorithms in Parquet Java

Mary-Kate Olsen

Jan 20, 2025 pm 06:04 PM

Compression algorithms in Parquet Java

Apache Parquet is a columnar storage format targeted at analytical workloads, but it can be used to store any type of structured data, addressing a variety of use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at both stages of the processing process. This reduces storage costs and improves read performance.

This article explains Parquet’s file compression in Java, provides usage examples, and analyzes its performance.

Compression technology

Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the use of more specific and efficient compression techniques based on locality and value redundancy of the same type of data.

Parquet writes information in binary format and applies compression at two different levels, each using a different technique:

When writing the value of the column, it will adaptively select the encoding type according to the characteristics of the initial value: dictionary encoding, run-length encoding, bit packing, incremental encoding, etc.
Whenever a certain number of bytes is reached (default is 1MB), a page is formed and the binary block is compressed using a programmer-configurable algorithm (no compression, GZip, Snappy, LZ4, ZSTD, etc.).

Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).

The performance of different compression technologies depends heavily on your data, so there is no one-size-fits-all solution that guarantees the fastest processing time and lowest storage consumption. You need to perform your own tests.

Code

Configuration is simple and only requires explicit setting when writing. When reading a file, Parquet discovers which compression algorithm is used and applies the corresponding decompression algorithm.

Configure algorithm or codec

In Carpet and Parquet using Protocol Buffers and Avro, to configure the compression algorithm, just call the builder's withCompressionCodec method:

Carpet

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Avro

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Protocol Buffers

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

The value must be one of the values available in the CompressionCodecName enumeration: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, LZ4_RAW should be used).

Compression level

Some compression algorithms provide a way to fine-tune the compression level. This level is usually related to how much effort they need to put into finding repeating patterns; the higher the compression level, the more time and memory the compression process requires.

Although they come with default values, they can be modified using Parquet's generic configuration mechanism, albeit using different keys for each codec.

Additionally, the values to choose are not standard and depend on each codec, so you must refer to the documentation for each algorithm to understand what each level offers.

ZSTD

To reference level configuration, the ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL.

Possible values range from 1 to 22, default value is 3.

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

LZO

To reference level configuration, the LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY.

Possible values range from 1 to 9, 99 and 999, with the default value being '999'.

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

GZIP

It does not declare any constants, you must use the string "zlib.compress.level" directly, possible values range from 0 to 9, the default value is "6".

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Performance Test

To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:

New York City Taxi Trip: Contains a large number of numeric and a small number of string values in several columns. It has 23 columns and contains 19.6 million records.
Cohesion Project of the Italian Government: Many columns contain floating point values as well as a large number of various text strings. It has 91 columns and contains 2 million rows.

I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.

As expected, I will be using Carpet with the default configuration provided by parquet-java and the default compression level for each algorithm.

You can find the source code on GitHub, testing was done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.

File size

To understand the performance of each compression, we will use the equivalent CSV file as a reference.

格式	gov.it	纽约出租车
CSV	1761 MB	2983 MB
未压缩	564 MB	760 MB
SNAPPY	220 MB	542 MB
GZIP	146 MB	448 MB
ZSTD	148 MB	430 MB
LZ4_RAW	209 MB	547 MB
LZO	215 MB	518 MB

Of the two tests, compression using GZip and Zstandard was the most efficient.

Using only Parquet encoding technology, file size can be reduced to 25%-32% of the original CSV size. With additional compression applied, it will be reduced to 9% to 15% of the CSV size.

Write

How much overhead does compressing information bring?

If we write the same information three times and calculate the average seconds, we get:

算法	gov.it	纽约出租车
未压缩	25.0	57.9
SNAPPY	25.2	56.4
GZIP	39.3	91.1
ZSTD	27.3	64.1
LZ4_RAW	24.9	56.5
LZO	26.0	56.1

SNAPPY, LZ4 and LZO achieve similar times to no compression, while ZSTD adds some overhead. GZIP had the worst performance, with write times slowing down by 50%.

Read

Reading a file is faster than writing because less computation is required.

The time in seconds to read all columns in the file is:

算法	gov.it	纽约出租车
未压缩	11.4	37.4
SNAPPY	12.5	39.9
GZIP	13.6	40.9
ZSTD	13.1	41.5
LZ4_RAW	12.8	41.6
LZO	13.1	41.1

The read time is close to that of uncompressed information, and the decompression overhead is between 10% and 20%.

Conclusion

No algorithm is significantly better than the others in terms of read and write times, all are in a similar range. In most cases, compressing information can make up for the space savings (and transmission) time lost.

In these two use cases, the deciding factor in choosing one or the other algorithm is probably the compression ratio achieved, with ZSTD and Gzip being prominent (but writing times being inferior).

Each algorithm has its advantages, so the best option is to test it with your data and consider which factor is more important:

Minimizes storage usage as you store large amounts of rarely used data.
Minimize file generation time.
Minimize read time as files are read multiple times.

Like everything in life, it's a trade-off and you have to see what best compensates for it. In Carpet, by default it uses Snappy for compression if you don't configure anything.

Implementation details

The value must be one of the values available in the CompressionCodecName enumeration. Associated with each enumeration value is the name of the class that implements the algorithm:

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec interface. If you look at its source code, you'll see that it's in the Hadoop project, not Parquet. This shows how well Parquet is coupled to Hadoop in its Java implementation.

To use one of these codecs, you must ensure that you have added the JAR containing its implementation as a dependency.

Not all implementations are present in the transitive dependencies you have when adding parquet-java, or you may be excluding Hadoop dependencies too aggressively.

In the org.apache.parquet:parquet-hadoop dependency, include implementations of SnappyCodec, ZstandardCodec, and Lz4RawCodec, which transitively imports the snappy-java, zstd-jni, and aircompressor dependencies along with the actual implementations of these three algorithms.

In hadoop-common:hadoop-common dependency, contains the implementation of GzipCodec.

Where are the implementations of BrotliCodec and LzoCodec? They are not in any Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not be able to use files compressed in those formats.

To support LZO, you need to add the dependency org.anarres.lzo:lzo-hadoop to your pom or gradle file.
The situation with Brotli is more complicated: the dependency is not in Maven Central and you must also add the JitPack repository.

The above is the detailed content of Compression algorithms in Parquet Java. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Is Java Platform Independent if then how?May 09, 2025 am 12:11 AM

Java is platform-independent because of its "write once, run everywhere" design philosophy, which relies on Java virtual machines (JVMs) and bytecode. 1) Java code is compiled into bytecode, interpreted by the JVM or compiled on the fly locally. 2) Pay attention to library dependencies, performance differences and environment configuration. 3) Using standard libraries, cross-platform testing and version management is the best practice to ensure platform independence.

The Truth About Java's Platform Independence: Is It Really That Simple?May 09, 2025 am 12:10 AM

Java'splatformindependenceisnotsimple;itinvolvescomplexities.1)JVMcompatibilitymustbeensuredacrossplatforms.2)Nativelibrariesandsystemcallsneedcarefulhandling.3)Dependenciesandlibrariesrequirecross-platformcompatibility.4)Performanceoptimizationacros

Java Platform Independence: Advantages for web applicationsMay 09, 2025 am 12:08 AM

Java'splatformindependencebenefitswebapplicationsbyallowingcodetorunonanysystemwithaJVM,simplifyingdeploymentandscaling.Itenables:1)easydeploymentacrossdifferentservers,2)seamlessscalingacrosscloudplatforms,and3)consistentdevelopmenttodeploymentproce

JVM Explained: A Comprehensive Guide to the Java Virtual MachineMay 09, 2025 am 12:04 AM

TheJVMistheruntimeenvironmentforexecutingJavabytecode,crucialforJava's"writeonce,runanywhere"capability.Itmanagesmemory,executesthreads,andensuressecurity,makingitessentialforJavadeveloperstounderstandforefficientandrobustapplicationdevelop

Key Features of Java: Why It Remains a Top Programming LanguageMay 09, 2025 am 12:04 AM

Javaremainsatopchoicefordevelopersduetoitsplatformindependence,object-orienteddesign,strongtyping,automaticmemorymanagement,andcomprehensivestandardlibrary.ThesefeaturesmakeJavaversatileandpowerful,suitableforawiderangeofapplications,despitesomechall

Java Platform Independence: What does it mean for developers?May 08, 2025 am 12:27 AM

Java'splatformindependencemeansdeveloperscanwritecodeonceandrunitonanydevicewithoutrecompiling.ThisisachievedthroughtheJavaVirtualMachine(JVM),whichtranslatesbytecodeintomachine-specificinstructions,allowinguniversalcompatibilityacrossplatforms.Howev

How to set up JVM for first usage?May 08, 2025 am 12:21 AM

To set up the JVM, you need to follow the following steps: 1) Download and install the JDK, 2) Set environment variables, 3) Verify the installation, 4) Set the IDE, 5) Test the runner program. Setting up a JVM is not just about making it work, it also involves optimizing memory allocation, garbage collection, performance tuning, and error handling to ensure optimal operation.

How can I check Java platform independence for my product?May 08, 2025 am 12:12 AM

ToensureJavaplatformindependence,followthesesteps:1)CompileandrunyourapplicationonmultipleplatformsusingdifferentOSandJVMversions.2)UtilizeCI/CDpipelineslikeJenkinsorGitHubActionsforautomatedcross-platformtesting.3)Usecross-platformtestingframeworkss

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

1 months agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

1664

1423

1318

1269

1248