I am the author of one of the packages in RHadoop, and I have written several examples distributed with source code and used in textbooks, logistic regression, linear least squares methods, matrix multiplication, etc. There is also a document that I would like to recommend http://www.mendeley.com/research/sorting-searching-simulation-mapreduce-framework/ which seems to strongly support mapreduce equivalence with classic parallel programming models such as PRAM and BSP . I often write mapreduce algorithms as ports from PRAM algorithms, see, for example, blog.piccolboni.info/2011/04/map-reduce-algorithm-for-connected.html. Therefore, I believe that the scale of mapreduce is clearly larger than the "embarrassing parallel", but not infinite. I myself experienced some limitations, for example, in speeding up some MCMC simulations. Of course, maybe I did not use the correct approach. My rule of thumb is: if the problem can be solved in parallel in O (log (N)) time on O (N) processors, then this is a good candidate for mapreduce, with tasks O (log (N)) and constant time spent for every job. Other people and the article I mentioned seem to focus more on the case of O (1). When you go beyond O (log (N)), the case for MR seems a little weaker, but some restrictions may be inherent in the current implementation (high overhead), rather than fundamental. This is a pretty fun time to work on scheduling MR territory.
source share