Is the gzipped Parquet file split into HDFS for Spark?

Question

Is the gzipped Parquet file split into HDFS for Spark?

I get confused messages when searching and reading answers online on this subject. Can anyone share their experiences? I know gzipped csv is not, but perhaps the file internal structures for Parquet are such that this is a completely different case for Parquet vs csv?

+6

gzip apache-spark parquet

YuGagarin Apr 10 '17 at 13:01

source share

1 answer

xhochy · Answer 1 · 2017-04-13T11:20:21+0000

GZIP compression parquet files are actually split. This is due to the internal layout of the Parquet files. They are always split, regardless of the compression algorithm used.

This fact is mainly due to the design of the Parquet files, divided into the following parts:

Parquet RowGroups, , HDFS.
RowGroup ColumnChunk . ColumnChunk RowGroup .
ColumnChunks , , , 64KiB 16MiB. , , .

: https://github.com/apache/parquet-format#file-format

Is the gzipped Parquet file split into HDFS for Spark?

More articles: