On an Ubuntu virtual machine, I installed one cluster node according to the Michael Noll tutorial , and this was my starting point for writing a Hadoop program.
Also for reference, this .
My program is in Python and uses Hadoop threads.
I wrote a simple vector multiplication program where mapper.py accepts input files v1 and v2 , each of which contains a vector in the form 12,33,10 and returns products. Then reducer.py returns the sum of the products, i.e.:
mapper : map(mult,v1,v2)
gearbox : sum(p1,p2,p3,...,pn)
mapper.py:
import sys def mult(x,y): return int(x)*int(y)
reducer.py:
import sys result = int() for a in sys.stdin: a = a.strip() a = a.split() for r in range(len(a)): result += int(a[r]) print result
In the in subdirectory, I have v1 containing 5,12,20 and v2 containing 14,11,3 .
Testing locally, everything works as expected:
hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py 70 132 60 hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py | sort 70 132 60 hduser@ubuntu :~/VectMult$ cat in/* | python ./mapper.py | sort | python ./reducer.py 262
When I run it in Hadoop, it seems to do it successfully and does not raise any exceptions:
hduser@ubuntu :/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper python /home/hduser/VectMult3/mapper.py -reducer python /home/hduser/VectMult3/reducer.py -input /home/hduser/VectMult3/in -output /home/hduser/VectMult3/out4 Warning: $HADOOP_HOME is deprecated. packageJobJar: [/app/hadoop/tmp/hadoop-unjar2168776605822419867/] [] /tmp/streamjob6920304075078514767.jar tmpDir=null 12/11/18 21:20:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/11/18 21:20:09 WARN snappy.LoadSnappy: Snappy native library not loaded 12/11/18 21:20:09 INFO mapred.FileInputFormat: Total input paths to process : 2 12/11/18 21:20:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 12/11/18 21:20:09 INFO streaming.StreamJob: Running job: job_201211181903_0009 12/11/18 21:20:09 INFO streaming.StreamJob: To kill this job, run: 12/11/18 21:20:09 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201211181903_0009 12/11/18 21:20:09 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201211181903_0009 12/11/18 21:20:10 INFO streaming.StreamJob: map 0% reduce 0% 12/11/18 21:20:24 INFO streaming.StreamJob: map 67% reduce 0% 12/11/18 21:20:33 INFO streaming.StreamJob: map 100% reduce 0% 12/11/18 21:20:36 INFO streaming.StreamJob: map 100% reduce 22% 12/11/18 21:20:45 INFO streaming.StreamJob: map 100% reduce 100% 12/11/18 21:20:51 INFO streaming.StreamJob: Job complete: job_201211181903_0009 12/11/18 21:20:51 INFO streaming.StreamJob: Output: /home/hduser/VectMult3/out4 hduser@ubuntu :/usr/local/hadoop$ bin/hadoop dfs -cat /home/hduser/VectMult3/out4/part-00000 Warning: $HADOOP_HOME is deprecated. hduser@ubuntu :/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/VectMult3/out4/ Warning: $HADOOP_HOME is deprecated. Found 3 items -rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_SUCCESS drwxr-xr-x - hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_logs -rw-r--r-- 1 hduser supergroup 0 2012-11-18 22:05 /home/hduser/VectMult3/out4/part-00000
But when I check the output, all I find is a 0-byte empty file.
I canβt understand what went wrong. Can anyone help?
Edit: Reply to @DiJuMx
One way to fix this is to output to a temporary file from the card, and then use the shortened temporary file.
Not sure if Hadoop allows this? Hopefully someone who knows better can fix me about this.
Before trying to do this, try writing a simpler version that simply transfers data directly without processing.
I thought it was a good idea, just to verify that the data is passing correctly. For this, I used the following:
Both mapper.py and reducer.py files
import sys
for i in sys.stdin: print i,
What comes out should be exactly what came in. Still displaying an empty file.
Alternatively, edit the existing code in abbreviation to display an error message in the output file if the input was empty
mapper.py
import sys for i in sys.stdin: print "mapped", print "mapper",
reducer.py
import sys for i in sys.stdin: print "reduced", print "reducer",
If the input is accepted, it should ultimately output reduced . In any case, it should at least output reducer . The actual output is an empty file.