Import a large text file into Spark

I have a text file with a channel limit that is 360 GB compressed (gzip). The file is in the S3 bucket. This is my first time using Spark. I understand that you can split a file to allow multiple work nodes to work with data, which leads to a huge increase in performance. However, I am trying to find an effective way to turn my single 360 ​​GB file into a partitioned file. Is there a way to use multiple spark work nodes to work on my compressed file to break it? Unfortunately, I cannot control the fact that I just get one huge file. I could unzip the file myself and split it into many files (say 360 1 GB files), but I just use one machine for this, and it will be pretty slow. I need to run some expensive data conversions,using Spark, so I find it necessary to split the file. I use Spark inside Amazon Glue, so I know that it can scale to a large number of machines. Also, I am using python (pyspark).

Thank.

+4
source share
3 answers

If I'm not mistaken, Spark uses Hadoop TextInputFormatif you are reading a file using SparkContext.textFile. If a compression codec is installed, TextInputFormatdetermines if the file is shared, checking if the code is an instance SplittableCompressionCodec.

I believe that GZIP is not split, Spark can only generate one section to read the entire file.

:
1. SparkContext.textFile, , , , .
2. GZIP
3. , , Spark.

4. GZIP ( ).

:

TextInputFormat TextInputFormat
GzipCodec GZIPCodec

java, , Python/ Scala.

+2

ORC zlib, 70% , ORC . orc .

df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
+1

Amazon S3DistCp EMR 360- HDFS ( , Hadoop EMR).

S3DistCp , gzip , Spark RDD.

, S3DistCp ( Hadoop Map/Reduce over S3. EMR S3, , Hadoop , Spark Map).

+1
source

Source: https://habr.com/ru/post/1688102/


All Articles