If I'm not mistaken, Spark uses Hadoop TextInputFormatif you are reading a file using SparkContext.textFile. If a compression codec is installed, TextInputFormatdetermines if the file is shared, checking if the code is an instance SplittableCompressionCodec.
I believe that GZIP is not split, Spark can only generate one section to read the entire file.
:
1. SparkContext.textFile, , , , .
2. GZIP
3. , , Spark.
4. GZIP ( ).
:
TextInputFormat
TextInputFormat
GzipCodec
GZIPCodec
java, , Python/ Scala.