Is the gzipped Parquet file split into HDFS for Spark?

I get confused messages when searching and reading answers online on this subject. Can anyone share their experiences? I know gzipped csv is not, but perhaps the file internal structures for Parquet are such that this is a completely different case for Parquet vs csv?

+6
source share
1 answer

GZIP compression parquet files are actually split. This is due to the internal layout of the Parquet files. They are always split, regardless of the compression algorithm used.

This fact is mainly due to the design of the Parquet files, divided into the following parts:

  • Parquet RowGroups, , HDFS.
  • RowGroup ColumnChunk . ColumnChunk RowGroup .
  • ColumnChunks , , , 64KiB 16MiB. , , .

: https://github.com/apache/parquet-format#file-format

+11

Source: https://habr.com/ru/post/1016441/


All Articles