Pig using python script with import modules

Working with pigtmp $ pig --version Apache Pig version 0.8.1-cdh3u1 (rexported) compiled Jul 18, 2011 08:29:40

I have a python script (c-python) that imports another script, both very simple in my example:

DATA example $ hadoop fs -cat / user / pavel / trivial.log

1 one 2 two 3 three 

EXAMPLE WITHOUT INCLUSION - Works great

example $ pig -f trivial_stream.pig

 (1,1,one) () (1,2,two) () (1,3,three) () 

where 1) trivial_stream.pig:

 DEFINE test_stream `test_stream.py` SHIP ('test_stream.py'); A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray); C = STREAM A THROUGH test_stream; DUMP C; 

2) test_stream.py

 #! /usr/bin/env python import sys import string for line in sys.stdin: if len(line) == 0: continue new_line = line print "%d\t%s" % (1, new_line) 

So, essentially, I just concatenate the strings with one key, nothing special.

EXAMPLE WITH INCLUSION - bombs! Now, I would like to add a line from the python import module, which is in the same directory as test_stream.py. I tried sending the import module in different ways, but getting the same error (see below)

1) trivial_stream.pig:

 DEFINE test_stream `test_stream.py` SHIP ('test_stream.py', 'test_import.py'); A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray); C = STREAM A THROUGH test_stream; DUMP C; 

2) test_stream.py

 #! /usr/bin/env python import sys import string import test_import for line in sys.stdin: if len(line) == 0: continue new_line = ("%s-%s") % (line.strip(), test_import.getTestLine()) print "%d\t%s" % (1, new_line) 

3) test_import.py

 def getTestLine(): return "test line"; 

Now

 example$ pig -f trivial_stream.pig 

Backend error message

 org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:265) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.cleanup(PigMapBase.java:103) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) 

Pig stack tracing

 ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1 at org.apache.pig.PigServer.openIterator(PigServer.java:753) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:396) at org.apache.pig.Main.main(Main.java:107) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209) at org.apache.pig.PigServer.storeEx(PigServer.java:885) at org.apache.pig.PigServer.store(PigServer.java:827) at org.apache.pig.PigServer.openIterator(PigServer.java:739) ... 7 more 

Thank you very much for your help! -Pavel

+4
source share
2 answers

The correct answer is from the comment above:

Dependencies are not sent, if you want your python application to work with pigs, you need to execute tar (don't forget init.py!) And then include the .tar file in the pig's SHIP statement. The first thing you do is to overclock the application. There may be problems with paths, so I would suggest the following even before retrieving tar: sys.path.insert (0, os.getcwd ()).

+3
source

You need to add the current directory to sys.path in test_stream.py :

 #! /usr/bin/env python import sys sys.path.append(".") 

So the SHIP command you received there sends a python script, but you just need to tell Python where to look.

+1
source

Source: https://habr.com/ru/post/1382582/


All Articles