See this link: How to process files, one per card?
- Upload your data to the S3 bucket
- Generate a file containing the full path s3n: // to each file
- Write a script mapper that:
- Print 'mapred_work_output_dir' from the environment (*)
- Performs XSLT conversion based on file name, saving to output directory
- Write a registry that does nothing
- Load the map / gear scripts into the S3 bucket.
- script AWS EMR
(*) workconf . . .