A short answer to my first question: AWS does not perform automatic indexing. I confirmed this with my own work, and also read the same from Andrew @AWS in my forum .
Here you can do the indexing:
To index some LZO files, you need to use your own Jar built from the hadoop-lzo Twitter project. You will need to build a Jar somewhere and then load it into Amazon S3 if you want to index directly using EMR.
On the side of the note, Cloudera has good instructions at all stages for setting this up in your own cluster . I did this on my local cluster, which allowed me to create a Jar and load on S3. You can probably find a pre-built Jar on the net if you don't want to create one yourself.
When outputting data from your Hadoop job, make sure that you use LzopCodec and not LzoCodec, otherwise the files are not indexed (at least based on my experience). Sample Java code (the same idea is carried over to the Streaming API):
import com.hadoop.compression.lzo.LzopCodec; TextOutputFormat.setCompressOutput(job, true); TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)
Once your chaotic lzo Jar is on S3 and your Hadoop job issues .lzo files, run your indexer in the output directory (the instructions below you received the EMR job / job):
elastic-mapreduce -j <existingJobId> \ --jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \ --args com.hadoop.compression.lzo.DistributedLzoIndexer \ --args s3://<yourBucketName>/output/myLzoJobResults \ --step-name "Lzo file indexer Jar"
Then, when you use the data in a future task, be sure to indicate that the input is in LZO format, otherwise splitting will not occur. Sample Java code:
import com.hadoop.mapreduce.LzoTextInputFormat; job.setInputFormatClass(LzoTextInputFormat.class);