When using LZO on the Hadoop output on AWS EMR, does it index files (stored on S3) for future automatic separation?

Question

When using LZO on the Hadoop output on AWS EMR, does it index files (stored on S3) for future automatic separation?

I want to use LZO compression on my Elastic Map Reduce output, which is stored in S3, but it is unclear whether the files will be automatically indexed so that future jobs run on this data and split the files into several tasks.

For example, if my output is a set of TSV data strings in a 1-bit LZO file, in the future map task only one task will be created or something like tasks (1GB / blockSize) (for example, the file behavior was not compressed or if in The directory was an LZO index file)?

Edit: If this is not done automatically, what is recommended so that my output is indexed by LZO? Do indexing before uploading a file to S3?

+4

amazon-s3 amazon-web-services elastic-map-reduce lzo

Dolan antenucci Oct 22 '12 at 21:13

source share

1 answer

Dolan antenucci · Accepted Answer · 2012-10-23T05:39:57+0000

A short answer to my first question: AWS does not perform automatic indexing. I confirmed this with my own work, and also read the same from Andrew @AWS in my forum .

Here you can do the indexing:

To index some LZO files, you need to use your own Jar built from the hadoop-lzo Twitter project. You will need to build a Jar somewhere and then load it into Amazon S3 if you want to index directly using EMR.

On the side of the note, Cloudera has good instructions at all stages for setting this up in your own cluster . I did this on my local cluster, which allowed me to create a Jar and load on S3. You can probably find a pre-built Jar on the net if you don't want to create one yourself.

When outputting data from your Hadoop job, make sure that you use LzopCodec and not LzoCodec, otherwise the files are not indexed (at least based on my experience). Sample Java code (the same idea is carried over to the Streaming API):

import com.hadoop.compression.lzo.LzopCodec; TextOutputFormat.setCompressOutput(job, true); TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)

Once your chaotic lzo Jar is on S3 and your Hadoop job issues .lzo files, run your indexer in the output directory (the instructions below you received the EMR job / job):

 elastic-mapreduce -j <existingJobId> \ --jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \ --args com.hadoop.compression.lzo.DistributedLzoIndexer \ --args s3://<yourBucketName>/output/myLzoJobResults \ --step-name "Lzo file indexer Jar"

Then, when you use the data in a future task, be sure to indicate that the input is in LZO format, otherwise splitting will not occur. Sample Java code:

 import com.hadoop.mapreduce.LzoTextInputFormat; job.setInputFormatClass(LzoTextInputFormat.class);

When using LZO on the Hadoop output on AWS EMR, does it index files (stored on S3) for future automatic separation?

More articles: