In practice, how many machines do you need for Hadoop / MapReduce / Mahout to speed up very parallelizable computing?

I need to do some machine learning calculations. I have a small number of standby computers. How many machines will I need in order to distribute my calculations using hadoop / mapreduce / mahout to be significantly faster than working on the same machine without these distributed frameworks? This is a practical question about computational overhead and benefits, since I assume that distributing the total time between two machines will be worse than not distributing and just working on the same machine (just because of all the overhead associated with distributing the computations).

Technical Note Some of the heavy computing is very parallel. All of them as long as each machine has its own copy of the raw data.

+6
source share
4 answers

A “simple” Java program and a Hadoop-based implementation based on MapReduce are very different animals and are difficult to compare. This is not how Hadoop parallelizes a bit of your program; it is written in a completely different form from top to bottom.

Hadoop has overhead: just the overhead of starting up and starting up workers like cartographers and gearboxes. It introduces much more time by serializing / deserializing data, writing it locally, and transferring it to HDFS.

A Hadoop-based implementation will always consume more resources. So, this is something to avoid if you cannot avoid it. If you can run unallocated computing on one machine, the simplest practical advice is not to distribute. Save yourself from trouble.

In the case of the Mahout recommendations, I can tell you that it is very rude, the Hadoop job is 2-4 times more computations than an unallocated implementation using the same data. Obviously, this depends on the algorithm and the choice of algo options. But to give you a number: I would not worry about a Hadoop cluster of less than 4 machines.

Obviously, if your calculations cannot fit on one of your computers, you have no choice but to distribute. Then the trade-off is that you can afford a wall clock in comparison with how much processing power you can spend. The reference to Amdahl’s law is right, although it does not take into account the significant overhead of Hadoop. For example, to parallelize N methods, you will need at least N cards / gearboxes, and carry N times the overhead for each converter / gearbox. There is also a specific start / shutdown time.

+8
source

See Amdahl Law

Amdahl's Law is a model for the relationship between the expected acceleration of parallel implementations of an algorithm relative to a sequential algorithm, assuming that the size of the problem remains unchanged when parallelized. For example, if for a given size of the problem the parallel implementation of the algorithm can execute 12% of the operations of the algorithm arbitrarily fast (while the remaining 88% of the operations are not parallelizable), Amdahl’s law states that the maximum acceleration of the parallel version is 1 / (1 - 0.12 ) = 1,136 times faster than non-parallel implementation.

Picture of equation

Without details, it is difficult to give a more detailed answer.

+6
source

I know that they already answered, but I will throw my hat in the ring. I cannot give you a general rule. Improving productivity really depends on many factors:

  • As parallel / mutually exclusive all components / algorithms are / is.
  • Dataset size
  • Pre-processing and post-processing of the data set [including splitting / displaying and reduction / concatenation]
  • Network traffic

If you have a highly connected algorithm, such as Bayes network, neural networks, brands, PCA and EM, then the hadoop program will receive processed, divided and recombined instances for a long time. [Assuming you have a large number of nodes per instance (more than 1 machine can handle). If you have this situation, network traffic will become a problem.

If you have an agorithm such as path searching or simulated annealing, it is easy to split the instances into their own map / cut job. These types of algorithms can be very fast.

+1
source

Another aspect is that you are the bottleneck that forces you to use mapreduce. If you have a reasonable data size on your machine, and you just check the speed acceleration, you may prefer to use the capabilities of the GPU. They are easier to set up and use even on the same machine with promising results.

0
source

Source: https://habr.com/ru/post/892694/


All Articles