I wrote a simple k-mean cluster code for Hadoop (two separate programs - a cartographer and a reducer). The code is working on a small set of data from 2d points in my local field. It is written in Python, and I plan to use the Streaming API.
I would like to make suggestions on how best to run this program on Hadoop.
After each start of the converter and gearbox, new centers are created. These centers are introduced for the next iteration.
From what I see, each iteration in mapreduce should be a separate task for creating a map. And it looks like I will have to write another script (python / bash) to extract new centers from HDFS after each phase of reduction and pass it back to mapper.
Any other simpler, less dirty way? If the cluster uses a fair scheduler, will it take a very long time until this calculation completes?
source
share