MapReduce - Anything Else Beyond Word Counting?

I watched MapReduce and read various articles about this and its applications, but it seems to me that MapReduce is only suitable for a very narrow class of scripts that ultimately lead to word counts.

If you look at the original paper , Google employees will provide “various” potential use cases such as “distributed grep”, “distributed sorting”, “reverse web link graph”, “host term vector”, etc. But if you look closer, all these problems come down to a simple “word count” - this is counting the number of occurrences of something in a piece of data, and then aggregating / filtering and sorting this list of occurrences.

There are also cases where MapReduce is used for genetic algorithms or relational databases, but they do not use the "vanilla" MapReduce published by Google. Instead, they introduce further steps along the Map-Reduce chain, such as Map-Reduce-Merge, etc.

Do you know of any other (documented?) Scenarios where the "vanilla" MapReduce is used to perform a simpler word count? (Possibly for ray tracing, video transcoding, cryptography, etc. - in short, "heavy computation" is parallelizable)

+4
source share
4 answers

MapReduce is good for issues that can be considered awkwardly parallel. There are many problems MapReduce does very poorly, such as those that require a lot in common between nodes. For example, fast Fourier transforms and signal correlation.

+2
source

Atbrox supported mapreduce hadoop algorithms in academic papers . Here's a link . All of them can be applied for practical purposes.

+4
source

There are projects using MapReduce for parallel computing in statistics. For example, Revolutions Analytics launched the RHadoop project for use with R. Hadoop is also used in computational biology and other areas with large data sets that can be analyzed by many discrete tasks.

+1
source

I am the author of one of the packages in RHadoop, and I have written several examples distributed with source code and used in textbooks, logistic regression, linear least squares methods, matrix multiplication, etc. There is also a document that I would like to recommend http://www.mendeley.com/research/sorting-searching-simulation-mapreduce-framework/ which seems to strongly support mapreduce equivalence with classic parallel programming models such as PRAM and BSP . I often write mapreduce algorithms as ports from PRAM algorithms, see, for example, blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html. Therefore, I believe that the scale of mapreduce is clearly larger than the "embarrassing parallel", but not infinite. I myself experienced some limitations, for example, in speeding up some MCMC simulations. Of course, maybe I did not use the correct approach. My rule of thumb is: if the problem can be solved in parallel in O (log (N)) time on O (N) processors, then this is a good candidate for mapreduce, with tasks O (log (N)) and constant time spent for every job. Other people and the article I mentioned seem to focus more on the case of O (1). When you go beyond O (log (N)), the case for MR seems a little weaker, but some restrictions may be inherent in the current implementation (high overhead), rather than fundamental. This is a pretty fun time to work on scheduling MR territory.

0
source

Source: https://habr.com/ru/post/1384240/


All Articles