If I'm not mistaken, Spark uses Hadoop TextInputFormat
if you are reading a file using SparkContext.textFile
. If a compression codec is installed, TextInputFormat
determines if the file is shared, checking if the code is an instance SplittableCompressionCodec
.
I believe that GZIP is not split, Spark can only generate one section to read the entire file.
:
1. SparkContext.textFile
, , , , .
2. GZIP
3. , , Spark.
4. GZIP ( ).
:
TextInputFormat
TextInputFormat
GzipCodec
GZIPCodec
java, , Python/ Scala.