Mar 17, 20213 min read

Parquet file -Explained

I realize that you may have never heard of the Apache Parquet file format. Similar to a CSV file, Parquet is a type of file.

Parquet is a free and open-source file format that is available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row-based files like CSV or TSV files. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Features of Parquet

Apache Parquet is column-oriented and designed to bring efficient columnar storage (blocks, row group, column chunks…) of data compared to row-based like CSV
Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates complex data structures that can be used to store the data.
Column-wise compression is efficient and saves storage space
Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, BigQuery, and Azure Data Lakes.
Different encoding techniques can be applied to different columns.
Apache Parquet can work with different programming language like C++, Java, Python etc…
Supports familiar data types, file metadata, automatic dictionary encoding.

Modules

The parquet-format project contains format specifications and Thrift definitions of metadata required to properly read Parquet files.
The parquet-mr project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop Input/Output Formats, Pig loaders, and other Java-based utilities for interacting with Parquet.
The parquet-cpp project is a C++ library to read-write Parquet files.
The parquet-rs project is a Rust library to read-write Parquet files.
The parquet-compatibility project contains compatibility tests that can be used to verify that implementations in different languages can read and write each other’s files.

Why Parquet?

Parquet performance, when compared to a format like CSV, offers compelling benefits in terms of cost, efficiency, and flexibility. By converting CSV data into Parquet’s column format, compressing and partitioning it we can save money and as well as achieve a better performance.

File format

This file and the thrift definition should be read together to understand the format.

image netjstech

In the above example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. More details on what is contained in the metadata can be found in the thrift files.

Pandas implements Parquet interface to read.

pandas.read_parquet(path, engine='auto', columns=None, use_nullable_dtypes=False, **kwargs)

Parameters Explained

Path- It’s the file’s path. Any valid string path is acceptable. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs.

Engine- By default it is set to auto. The default behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

Columns-If this parameter is set to some integer then only these columns will be read from file

use_nullable_dtypes- If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame (only applicable for engine="pyarrow"). As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes.

**kwargs-Any additional kwargs are passed to the engine.

Summary

Hopefully, the above details will help you achieve better performance with parquet files.

References

https://parquet.apache.org/documentation/latest

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html