Recently, I have encountered the same problem. Your answer can be simplified. You don't need to install jython or create a cache directory at all. You need to include the jython jar in the boot instance of the EMR script (or do something similar). I wrote a bootstrap EMR script with the following lines. This can be simplified by not even using s3cmd, but using your job stream (to put files in a specific directory). Getting UDF through s3cmd is definitely inconvenient, however I was not able to register the udf file on s3 when using the EMR version for the pig.
If you use CharStream, you must include this jar in the piglib path as well. Depending on the structure used, you can pass these bootstrap scripts as parameters of your work, EMR supports this through its ruby ββclient with an elastic mapreduce. A simple option is to place bootstrap scripts on s3.
If you use s3cmd in a bootstrap script, you need another bootstrap script that does something like this. This script should be placed in front of the other in boot order. I am moving away from using s3cmd, but for my successful attempt s3cmd did the trick. In addition, the s3cmd executable is already installed in the Amazon image of a pig (for example, ami version 2.0 and application version 0.20.205.
Script # 1 (s3cmd sowing)
Script # 2 (jython jars layout)
#!/bin/bash set -e s3cmd get <jython.jar> # Very useful for extra libraries not available in the jython jar. I got these libraries from the # jython site and created a jar archive. s3cmd get <jython_extra_libs.jar> s3cmd get <UDF> PIG_LIB_PATH=/home/hadoop/piglibs mkdir -p $PIG_LIB_PATH mv <jython.jar> $PIG_LIB_PATH mv <jython_extra_libs.jar> $PIG_LIB_PATH mv <UDF> $PIG_LIB_PATH # Change hadoop classpath as well. echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >> /home/hadoop/conf/hadoop-user-env.sh
source share