Hadoop map order / priority

Question

Hadoop map order / priority

I have ~ 5000 entries in my Hadoop input file, but I know in advance that some of the lines will require much more time to process than others (at the map stage). (Mainly because I need to download the file from Amazon S3, and the file size will vary between tasks)

I want to make sure that the biggest map tasks are handled first, to make sure that all my hadoop nodes will work at about the same time.

Is there a way to do this with Hadoop? Or do I need to redo all this? (I'm new to Hadoop)

Thanks!

+4

priority-queue mapreduce hadoop hadoop-partitioning

kiv Aug 12 '13 at 6:11

source share

1 answer

oae · Answer 1 · 2013-08-12T08:50:21+0000

Well, if you implemented your own InputFormat (the getSplits () method contains the logic for creating the split), then theoretically you could achieve what you want.

BUT, you have to be especially careful because the order in which InputFormat returns separators is not the order in which Hadoop processes it. Inside JobClient, there is a reordering code with a section:

// sort the splits into order based on size, so that the biggest // go first Arrays.sort(array, new NewSplitComparator());

which will make it all more complicated. But you can implement your own InputFormat + your own InputSplit and make the length of InputSlip # length () dependent on the expected execution time.

Hadoop map order / priority

More articles: