Python code is valid, but Hadoop Streaming produces part-00000 "Empty file",

Question

Python code is valid, but Hadoop Streaming produces part-00000 "Empty file",

On an Ubuntu virtual machine, I installed one cluster node according to the Michael Noll tutorial , and this was my starting point for writing a Hadoop program.

Also for reference, this .

My program is in Python and uses Hadoop threads.

I wrote a simple vector multiplication program where mapper.py accepts input files v1 and v2 , each of which contains a vector in the form 12,33,10 and returns products. Then reducer.py returns the sum of the products, i.e.:

mapper : map(mult,v1,v2)

gearbox : sum(p1,p2,p3,...,pn)

mapper.py:

 import sys def mult(x,y): return int(x)*int(y) # Input comes from STDIN (standard input). inputvec = tuple() for i in sys.stdin: i = i.strip() inputvec += (tuple(i.split(",")),) v1 = inputvec[0] v2 = inputvec[1] results = map(mult, v1, v2) # Simply printing the results variable would print the tuple. This # would be fine except that the STDIN of reduce.py takes all the # output as input, including brackets, which can be problematic # Cleaning the output ready to be input for the Reduce step: for o in results: print ' %s' % o,

reducer.py:

 import sys result = int() for a in sys.stdin: a = a.strip() a = a.split() for r in range(len(a)): result += int(a[r]) print result

In the in subdirectory, I have v1 containing 5,12,20 and v2 containing 14,11,3 .

Testing locally, everything works as expected:

 hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py 70 132 60 hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py | sort 70 132 60 hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py | sort | python ./reducer.py 262

When I run it in Hadoop, it seems to do it successfully and does not raise any exceptions:

 hduser@ubuntu :/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper python /home/hduser/VectMult3/mapper.py -reducer python /home/hduser/VectMult3/reducer.py -input /home/hduser/VectMult3/in -output /home/hduser/VectMult3/out4 Warning: $HADOOP_HOME is deprecated. packageJobJar: [/app/hadoop/tmp/hadoop-unjar2168776605822419867/] [] /tmp/streamjob6920304075078514767.jar tmpDir=null 12/11/18 21:20:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/11/18 21:20:09 WARN snappy.LoadSnappy: Snappy native library not loaded 12/11/18 21:20:09 INFO mapred.FileInputFormat: Total input paths to process : 2 12/11/18 21:20:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 12/11/18 21:20:09 INFO streaming.StreamJob: Running job: job_201211181903_0009 12/11/18 21:20:09 INFO streaming.StreamJob: To kill this job, run: 12/11/18 21:20:09 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201211181903_0009 12/11/18 21:20:09 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201211181903_0009 12/11/18 21:20:10 INFO streaming.StreamJob: map 0% reduce 0% 12/11/18 21:20:24 INFO streaming.StreamJob: map 67% reduce 0% 12/11/18 21:20:33 INFO streaming.StreamJob: map 100% reduce 0% 12/11/18 21:20:36 INFO streaming.StreamJob: map 100% reduce 22% 12/11/18 21:20:45 INFO streaming.StreamJob: map 100% reduce 100% 12/11/18 21:20:51 INFO streaming.StreamJob: Job complete: job_201211181903_0009 12/11/18 21:20:51 INFO streaming.StreamJob: Output: /home/hduser/VectMult3/out4 hduser@ubuntu :/usr/local/hadoop$ bin/hadoop dfs -cat /home/hduser/VectMult3/out4/part-00000 Warning: $HADOOP_HOME is deprecated. hduser@ubuntu :/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/VectMult3/out4/ Warning: $HADOOP_HOME is deprecated. Found 3 items -rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_SUCCESS drwxr-xr-x - hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_logs -rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/part-00000

But when I check the output, all I find is a 0-byte empty file.

I can’t understand what went wrong. Can anyone help?

Edit: Reply to @DiJuMx

One way to fix this is to output to a temporary file from the card, and then use the shortened temporary file.

Not sure if Hadoop allows this? Hopefully someone who knows better can fix me about this.

Before trying to do this, try writing a simpler version that simply transfers data directly without processing.

I thought it was a good idea, just to verify that the data is passing correctly. For this, I used the following:

Both mapper.py and reducer.py files
import sys

 for i in sys.stdin: print i,

What comes out should be exactly what came in. Still displaying an empty file.

Alternatively, edit the existing code in abbreviation to display an error message in the output file if the input was empty

mapper.py

 import sys for i in sys.stdin: print "mapped", print "mapper",

reducer.py

 import sys for i in sys.stdin: print "reduced", print "reducer",

If the input is accepted, it should ultimately output reduced . In any case, it should at least output reducer . The actual output is an empty file.

+4

python mapreduce hadoop hadoop-streaming

dafuloth Nov 18 '12 at 10:07

source share

6 answers

DiJuMx · Answer 1 · 2012-11-19T16:00:35+0000

I have no experience with hadoop (or python, for that matter), however, as I noticed, you indicated a way out to go to /home/hduser/VectMult3/out4 , but expecting it to be in /home/hduser/VectMult3/out/part-00000 .

Have you checked the out4 file and its contents?

DiJuMx · Answer 2 · 2012-11-20T22:52:49+0000

Assuming your code is completely correct (which, as shown in the first part of your question, I assume), the problem can be narrowed down to the environment in which you work. In this case, I would say that this was because the output signal from the converter is NOT supplied to the gearbox (as when starting commands manually)

One way to fix this is to output to a temporary file from the card, and then use the shortened temporary file.

Before trying to do this, try writing a simpler version that simply transfers data directly without processing. If you still get nothing, try a temporary file.

Alternatively, edit the existing code in abbreviation to display an error message in the output file if the input was empty

Steven · Answer 3 · 2013-04-03T15:09:09+0000

I know this question is pretty old, but I would like to help anyway. I see that in the examples reducer.py and mapper.py you just output the same value. I believe (I am just starting with Hadoop, but it has been my experience so far) that it requires a pair of key values, separated by a tab.

For example, the output of your cartographer might look like the following:
print "%s\t%s" % (str(random.randint(0,100000)), "mapped")

I'm not sure, but the gearbox is probably able to output something. If this still doesn't work, create a test case following the instructions on this blog http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ that worked in my environment so hopefully this will do the same for you. If this example does not work, this is most likely a problem with configuring Hadoop.

Yann · Answer 4 · 2013-11-22T15:04:44+0000

Have you tried replacing python mapper.py with "python mapper.py" ? I assume that mappers run python instead of python mapper.py . This may explain your empty release. In addition, your mapper.py file should not be located on HDFS, but rather somewhere locally. Then send it to the jar file of the work file ( doc ) using the -file <path_to_local_file> option on the command line:

 /bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper "python mapper.py" -file <path_to_your_local_mapper.py> -reducer "python reducer.py" -file <path_to_your_local_reducer.py> -input /home/hduser/VectMult3/in -output <output_dir_name>

Kali_89 · Answer 5 · 2013-11-25T14:08:29+0000

Had similar problems and found that, making sure the next line was at the top of my Python scripts, she fixed it:

#!/usr/bin/python

Maybe give it and see if it helps?

ps In addition, just looking at our Python mappers and reducers, we use the print operator without comment at the end of the line.

Neda n · Answer 6 · 2015-07-11T02:04:53+0000

I had the same problem. mapper.py and reducer.py worked fine, but Hadoop streaming returned an empty file without any errors!

I solved this problem using the Hadoop Streaming code as shown below: (note the format I used for -mapper and -reducer!)

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-*streaming*.jar -file /home/training/Desktop/mapper.py -mapper 'python mapper.py' -file /home/training/Desktop/reducer.py -reducer 'python reducer.py' -input /user/training/sales.csv -output /user/training/output18

I hope this helps others with the same problem.

Python code is valid, but Hadoop Streaming produces part-00000 "Empty file",

More articles: