Running Python on Hadoop

Question

Running Python on Hadoop

I am trying to run a very simple python script through hive and hadoop.

This is my script:

#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() nums = line.split() i = nums[0] print i

And I want to run it in the following table:

 hive> select * from test; OK 1 3 2 2 3 1 Time taken: 0.071 seconds hive> desc test; OK col1 int col2 string Time taken: 0.215 seconds

I run:

 hive> select transform (col1, col2) using './proba.py' from test;

But always get something like:

 ... 2011-11-18 12:23:32,646 Stage-1 map = 0%, reduce = 0% 2011-11-18 12:23:58,792 Stage-1 map = 100%, reduce = 100% Ended Job = job_201110270917_20215 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I have tried many different modifications to this procedure, but I constantly fail. :(

Am I doing something wrong or is there a problem with my hive / hadoop installation?

+4

python hadoop hive

twowo Nov 18 '11 at 11:33

source share

2 answers

Matthew Rathbone · Answer 1 · 2011-11-22T22:28:52+0000

A few things I would check if I were debugging this:

1) Is the python executable file (chmod + x file.py)

2) Make sure the python file is in the same place on all computers. Probably better - put the file in hdfs, then you can use "using" hdfs: //path/to/file.py '"instead of the local path

3) Take a look at your work on the hadoop dashboard (http: // master-node: 9100), if you click on a failed task, it will give you the actual java errors and stack traces so you can see what actually went wrong so with doing

4) make sure python is installed on all sub nodes! (I always forget about it)

Hope this helps ...

Dave brondsema · Answer 2 · 2011-12-16T06:58:37+0000

Check hive.log and / or the log from the hadoop job (job_201110270917_20215 in your example) for a more detailed error message.

Running Python on Hadoop

More articles: