A “simple” Java program and a Hadoop-based implementation based on MapReduce are very different animals and are difficult to compare. This is not how Hadoop parallelizes a bit of your program; it is written in a completely different form from top to bottom.
Hadoop has overhead: just the overhead of starting up and starting up workers like cartographers and gearboxes. It introduces much more time by serializing / deserializing data, writing it locally, and transferring it to HDFS.
A Hadoop-based implementation will always consume more resources. So, this is something to avoid if you cannot avoid it. If you can run unallocated computing on one machine, the simplest practical advice is not to distribute. Save yourself from trouble.
In the case of the Mahout recommendations, I can tell you that it is very rude, the Hadoop job is 2-4 times more computations than an unallocated implementation using the same data. Obviously, this depends on the algorithm and the choice of algo options. But to give you a number: I would not worry about a Hadoop cluster of less than 4 machines.
Obviously, if your calculations cannot fit on one of your computers, you have no choice but to distribute. Then the trade-off is that you can afford a wall clock in comparison with how much processing power you can spend. The reference to Amdahl’s law is right, although it does not take into account the significant overhead of Hadoop. For example, to parallelize N methods, you will need at least N cards / gearboxes, and carry N times the overhead for each converter / gearbox. There is also a specific start / shutdown time.
source share