GZIP compression parquet files are actually split. This is due to the internal layout of the Parquet files. They are always split, regardless of the compression algorithm used.
This fact is mainly due to the design of the Parquet files, divided into the following parts:
- Parquet RowGroups, , HDFS.
- RowGroup ColumnChunk . ColumnChunk RowGroup .
- ColumnChunks , , , 64KiB 16MiB. , , .
: https://github.com/apache/parquet-format#file-format