Spark / Hadoop throws an exception for large LZO files

Question

Spark / Hadoop throws an exception for large LZO files

I am running an EMR Spark job on some LZO-compressed log files stored in S3. Several log files are stored in one folder, for example:

... s3://mylogfiles/2014-08-11-00111.lzo s3://mylogfiles/2014-08-11-00112.lzo ...

In the spark shell, I run a task that counts lines in files. If I calculate the lines separately for each file, there is no problem, for example. eg:

 // Works fine ... sc.textFile("s3://mylogfiles/2014-08-11-00111.lzo").count() sc.textFile("s3://mylogfiles/2014-08-11-00112.lzo").count() ...

If I use wild-card to load all files with one layer, I get two kinds of exceptions.

 // One-liner throws exceptions sc.textFile("s3://mylogfiles/*.lzo").count()

The exceptions are:

 java.lang.InternalError: lzo1x_decompress_safe returned: -6 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)

and

 java.io.IOException: Compressed length 1362309683 exceeds max block size 67108864 (probably corrupt file) at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)

It seems to me that the solution is outlined by the text given with the last exception, but I do not know how to do it. Is there a limit on how large LZO files are, or what is the problem?

My question is . Can I run Spark requests that download all compressed LZO files to the S3 folder without receiving I / O related exceptions?

There are 66 files of approximately 200 MB per file.

EDIT : An exception occurs only when starting Spark with the Hadoop2 core libraries (ami 3.1.0). When working with Hadoop1 core libs (ami 2.4.5) everything works fine. Both cases were tested with Spark 1.0.1.

+5

elastic-map-reduce hadoop apache-spark lzo

Pimin Konstantin Kefaloukos Aug 11 '14 at 16:37

source share

3 answers

Kgeyti's answer works fine, but:

LzoTextInputFormat represents a performance hit because it checks the .index file for each LZO file. This can be especially painful in many LZO files on S3 (I experienced a delay of several minutes caused by thousands of requests for S3).

If you know that your LZO files are not shared, a more efficient solution is to create a custom, non-decomposable input format:

 import org.apache.hadoop.fs.Path import org.apache.hadoop.mapreduce.JobContext import org.apache.hadoop.mapreduce.lib.input.TextInputFormat class NonSplittableTextInputFormat extends TextInputFormat { override def isSplitable(context: JobContext, file: Path): Boolean = false }

and read the following files:

 context.newAPIHadoopFile("s3://mylogfiles/*.lzo", classOf[NonSplittableTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text]) .map(_._2.toString)

+5

Eric Eijkelenboom Oct 2 '14 at 20:02

source share

Yesterday, we deployed Hive on an EMR cluster and had the same problem with some LZO files in S3 that were accepted without any problems by another cluster without EMR. After some digging in the logs, I noticed that the card tasks read S3 files in pieces of 250 MB, although the files are certainly not shared .

It turned out that the mapreduce.input.fileinputformat.split.maxsize parameter is set to 250000000 ~ 250 MB. This led to LZO opening the stream from within the file and, ultimately, a damaged LZO block.

I set mapreduce.input.fileinputformat.split.maxsize = 2,000,000,000 more than the maximum size of our input, and now everything works.

I'm not quite sure how this exactly correlates with Spark, but changing InputFormat may help, which seems like a problem in the first place, as mentioned in How Amavon EMR Hive differs from Apache Hive .

+4

Cedrik neumann Aug 13 '14 at 21:32

source share

jkgeyti · Accepted Answer · 2014-08-15T15:14:31+0000

I myself have not come across this specific problem, but it seems that the .textFile expects the files to be shared, like the Cedrik issue for Hive insisting on using CombineFileInputFormat

You can index your lzo files or try using LzoTextInputFormat - I would be interested to know if this works better HEY:

 sc.newAPIHadoopFile("s3://mylogfiles/*.lz", classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text]) .map(_._2.toString) // if you just want a RDD[String] without writing a new InputFormat .count

Spark / Hadoop throws an exception for large LZO files

More articles: