Does the card abbreviate terabytes of data algorithms?

This question does not have a single β€œcorrect” answer.

I am interested in running Map Reduce algorithms in a terabyte cluster of data.

I want to know more about the runtime of these algorithms.

What books should be read?

I am not interested in setting up a map Reduce clusters or run standard algorithms. I want a rigorous theoretical treatment or work time.

EDIT: The problem is not that the map reduces the execution time of the changes. The problem is that most algorithms do not distribute well in order to transform the reduction structures. I'm interested in the algorithms that run on the map, reduce the scope.

+4
source share
3 answers

Technically, there is no real difference in MapReduce runtime analysis compared to standard algorithms. MapReduce is still an algorithm, like any other (or, in particular, a class of algorithms that occur in several stages, with a certain interaction between these steps).

MapReduce job execution time will continue to scale, as normal algorithmic analysis predicts, when you consider the separation of tasks on multiple machines, and then find the maximum individual machine time required for each step.

That is, if you have a task that requires M-card operations and R reduction operations running on N machines, and you expect the average card operation to take time and the average operating time r, the expected execution time is ceil(M/N)*m + ceil(R/N)*r for all the tasks in question.

Predicting values ​​for M, R, m, and r is all that can be accomplished by normal analysis of any algorithm that you connect to MapReduce.

+4
source

There are only two books that I know about, but in the work more:

Pro hadoop and Hadoop: the ultimate guide

Of these, Pro Hadoop is more of a book for beginners, while The Definitive Guide is for those who know what Hadoop really is.

I have The Definitive Guide and think this is a great book. It contains good technical details on how HDFS works, and also covers a number of related topics such as MapReduce, Pig, Hive, HBase, etc. It should also be noted that this book was written by Tom White, who has been involved in the development of Hadoop for some time, and now works on cloudera .

Regarding the analysis of algorithms on Hadoop, you can take a look at the TeraByte sorting tests. Yahoo has recorded how Hadoop performs for this particular test: TeraByte Sort by Apache Hadoop . This article was written in 2008.

More information on the results for 2009 can be found here .

+1
source

There is an excellent book on data mining algorithms applied to the MapReduce model.

It was written by two Stanford professors, and if it is available for free:

http://infolab.stanford.edu/~ullman/mmds.html

0
source

Source: https://habr.com/ru/post/1300835/


All Articles