I am running an EMR Spark job on some LZO-compressed log files stored in S3. Several log files are stored in one folder, for example:
... s3://mylogfiles/2014-08-11-00111.lzo s3://mylogfiles/2014-08-11-00112.lzo ...
In the spark shell, I run a task that counts lines in files. If I calculate the lines separately for each file, there is no problem, for example. eg:
// Works fine ... sc.textFile("s3://mylogfiles/2014-08-11-00111.lzo").count() sc.textFile("s3://mylogfiles/2014-08-11-00112.lzo").count() ...
If I use wild-card to load all files with one layer, I get two kinds of exceptions.
// One-liner throws exceptions sc.textFile("s3://mylogfiles/*.lzo").count()
The exceptions are:
java.lang.InternalError: lzo1x_decompress_safe returned: -6 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
and
java.io.IOException: Compressed length 1362309683 exceeds max block size 67108864 (probably corrupt file) at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)
It seems to me that the solution is outlined by the text given with the last exception, but I do not know how to do it. Is there a limit on how large LZO files are, or what is the problem?
My question is . Can I run Spark requests that download all compressed LZO files to the S3 folder without receiving I / O related exceptions?
There are 66 files of approximately 200 MB per file.
EDIT : An exception occurs only when starting Spark with the Hadoop2 core libraries (ami 3.1.0). When working with Hadoop1 core libs (ami 2.4.5) everything works fine. Both cases were tested with Spark 1.0.1.