Iterative MapReduce

I wrote a simple k-mean cluster code for Hadoop (two separate programs - a cartographer and a reducer). The code is working on a small set of data from 2d points in my local field. It is written in Python, and I plan to use the Streaming API.

I would like to make suggestions on how best to run this program on Hadoop.

After each start of the converter and gearbox, new centers are created. These centers are introduced for the next iteration.

From what I see, each iteration in mapreduce should be a separate task for creating a map. And it looks like I will have to write another script (python / bash) to extract new centers from HDFS after each phase of reduction and pass it back to mapper.

Any other simpler, less dirty way? If the cluster uses a fair scheduler, will it take a very long time until this calculation completes?

+3
source share
4 answers

Feels confused to answer my own question. I used PIG 0.9 (not yet released, but available in the trunk). There is support for modularity and flow control in this, allowing you to embed PIG expressions inside scripting languages ​​such as Python.

So, I wrote the main python script, which had a loop, and inside, which was called my PIG scripts. PIG scripts made calls to UDF. Therefore, I had to write three different programs. But everything turned out fine.

- http://www.mail-archive.com/user@pig.apache.org/msg00672.html

UDF Python, , UDF .

0

. ( while) , , , , .

+1

Here are some ways to do this: github.com/bwhite/hadoop_vision/tree/master/kmeans

Also check this out (supports oozie): http://bwhite.github.com/hadoopy/

0
source

Source: https://habr.com/ru/post/1782247/


All Articles