How do you use Python UDF with Pig in Elastic MapReduce?

I really want to use Python UDF in Pig on our AWS Elastic MapReduce, but I can't get everything to work fine. No matter what I try to complete, my pig job ends with the following exception:

ERROR 2998: Unhandled internal error. org/python/core/PyException java.lang.NoClassDefFoundError: org/python/core/PyException at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127) at org.apache.pig.PigServer.registerCode(PigServer.java:568) at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:437) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 14 more 

What do you need to do to use Python UDF for Pig in Elastic MapReduce?

+4
source share
4 answers

After a few twists and turns, I found that at least on an elastic map it reduces the implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. Instead, I found that I can control the class path using the HADOOP_CLASSPATH variable instead.

Once I implemented this implementation, it was pretty easy to configure Python UDFS settings:

  • Install jython
    • sudo apt-get install jython -y -qq
  • Set the environment variable HADOOP_CLASSPATH.
    • export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
      • jython.jar ensures that Hadoop can find the PyException class
      • antlr-runtime-3.2.jar ensures that Hadoop can find the CharStream class
  • Create a cache directory for Jython (this is documented in the Jython FAQ )
    • sudo mkdir /usr/share/java/cachedir/
    • sudo chmod a+rw /usr/share/java/cachedir

I should point out that this seems to directly contradict the other tips I found when looking for solutions to this problem:

  • Setting the CLASSPATH and PIG_CLASSPATH environment variables does not seem to do anything.
  • A .py file containing UDF does not have to be included in the HADOOP_CLASSPATH environment variable.
  • The path to the .py file used in the Pig register instruction can be relative or absolute, it does not seem to matter.
+2
source

Hmm ... to clarify some of what I just read here, at this point, using the UDF python in Pig running on EMR stored on s3, it is as simple as this line in your pig script:

REGISTER 's3: //path/to/bucket/udfs.py' using jython as mynamespace

That is, no class modifications are required. I am using this in production right now, although with the caution that I am not dealing with additional python modules in my udf. I think this may affect what you need to do to make it work.

+4
source

Recently, I have encountered the same problem. Your answer can be simplified. You don't need to install jython or create a cache directory at all. You need to include the jython jar in the boot instance of the EMR script (or do something similar). I wrote a bootstrap EMR script with the following lines. This can be simplified by not even using s3cmd, but using your job stream (to put files in a specific directory). Getting UDF through s3cmd is definitely inconvenient, however I was not able to register the udf file on s3 when using the EMR version for the pig.

If you use CharStream, you must include this jar in the piglib path as well. Depending on the structure used, you can pass these bootstrap scripts as parameters of your work, EMR supports this through its ruby ​​client with an elastic mapreduce. A simple option is to place bootstrap scripts on s3.

If you use s3cmd in a bootstrap script, you need another bootstrap script that does something like this. This script should be placed in front of the other in boot order. I am moving away from using s3cmd, but for my successful attempt s3cmd did the trick. In addition, the s3cmd executable is already installed in the Amazon image of a pig (for example, ami version 2.0 and application version 0.20.205.

Script # 1 (s3cmd sowing)

 #!/bin/bash cat <<-OUTPUT > /home/hadoop/.s3cfg [default] access_key = YOUR KEY bucket_location = US cloudfront_host = cloudfront.amazonaws.com cloudfront_resource = /2010-07-15/distribution default_mime_type = binary/octet-stream delete_removed = False dry_run = False encoding = UTF-8 encrypt = False follow_symlinks = False force = False get_continue = False gpg_command = /usr/local/bin/gpg gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd % (passphrase_fd)s -o %(output_file)s %(input_file)s gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s gpg_passphrase = YOUR PASSPHRASE guess_mime_type = True host_base = s3.amazonaws.com host_bucket = %(bucket)s.s3.amazonaws.com human_readable_sizes = False list_md5 = False log_target_prefix = preserve_attrs = True progress_meter = True proxy_host = proxy_port = 0 recursive = False recv_chunk = 4096 reduced_redundancy = False secret_key = YOUR SECRET send_chunk = 4096 simpledb_host = sdb.amazonaws.com skip_existing = False socket_timeout = 10 urlencoding_mode = normal use_https = False verbosity = WARNING OUTPUT 

Script # 2 (jython jars layout)

 #!/bin/bash set -e s3cmd get <jython.jar> # Very useful for extra libraries not available in the jython jar. I got these libraries from the # jython site and created a jar archive. s3cmd get <jython_extra_libs.jar> s3cmd get <UDF> PIG_LIB_PATH=/home/hadoop/piglibs mkdir -p $PIG_LIB_PATH mv <jython.jar> $PIG_LIB_PATH mv <jython_extra_libs.jar> $PIG_LIB_PATH mv <UDF> $PIG_LIB_PATH # Change hadoop classpath as well. echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >> /home/hadoop/conf/hadoop-user-env.sh 
0
source

To date, using Pig 0.9.1 in EMR, I have found that enough is enough:

 env HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/path/to/jython.jar pig -f script.pig 

where script.pig stores the Python script register, but not jython.jar :

 register Pig-UDFs/udfs.py using jython as mynamespace; 
0
source

Source: https://habr.com/ru/post/1396710/


All Articles