Import a large text file into Spark

Question

Import a large text file into Spark

I have a text file with a channel limit that is 360 GB compressed (gzip). The file is in the S3 bucket. This is my first time using Spark. I understand that you can split a file to allow multiple work nodes to work with data, which leads to a huge increase in performance. However, I am trying to find an effective way to turn my single 360 GB file into a partitioned file. Is there a way to use multiple spark work nodes to work on my compressed file to break it? Unfortunately, I cannot control the fact that I just get one huge file. I could unzip the file myself and split it into many files (say 360 1 GB files), but I just use one machine for this, and it will be pretty slow. I need to run some expensive data conversions,using Spark, so I find it necessary to split the file. I use Spark inside Amazon Glue, so I know that it can scale to a large number of machines. Also, I am using python (pyspark).

Thank.

+4

bigdata apache-spark pyspark

Dwayne driskill Oct 24 '17 at 17:57

source share

3 answers

buubovich · Answer 1 · 2017-10-25T00:10:29+0000

If I'm not mistaken, Spark uses Hadoop TextInputFormatif you are reading a file using SparkContext.textFile. If a compression codec is installed, TextInputFormatdetermines if the file is shared, checking if the code is an instance SplittableCompressionCodec.

I believe that GZIP is not split, Spark can only generate one section to read the entire file.

:
1. SparkContext.textFile, , , , .
2. GZIP
3. , , Spark.

4. GZIP ( ).

:

TextInputFormat TextInputFormat
GzipCodec GZIPCodec

java, , Python/ Scala.

Sahil Desai · Answer 2 · 2017-10-27T10:15:18+0000

ORC zlib, 70% , ORC . orc .

df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")

Alexandre Dupriez · Answer 3 · 2017-10-27T16:28:14+0000

Amazon S3DistCp EMR 360- HDFS ( , Hadoop EMR).

S3DistCp , gzip , Spark RDD.

, S3DistCp ( Hadoop Map/Reduce over S3. EMR S3, , Hadoop , Spark Map).

Import a large text file into Spark

More articles: