How good (or bad) does MongoDB handle the map in terms of performance? Do you really need a lot of cars to get acceptable performance?
MongoDB Map / Reduce the implementation (in version 2.0.x) is limited by its dependence on the single - threaded JavaScript SpiderMonkey engine . There have been experiments with the v8 JavaScript engine and concurrency has been improved, and performance is a common design goal.
The new Aggregate Framework is written in C ++ and has a more scalable implementation, including a pipelined approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. Currently, the aggregation structure does not replace all tasks that can be performed in Map / Reduce, but simplifies many common use cases.
The third option is to use MongoDB for storage in conjunction with Hadoop through the MongoDB Hadoop connector . Hadoop currently has a more scalable Map / Reduce implementation and can access MongoDB collections for input and output through the Hadoop Connector.
From a workflow point of view, is it relatively easy to store and combine incremental results generated by map reduction?
Map / Reduce has several options, including merging incremental output into a previous output collection or returning inline results (in memory).
How much of a performance improvement does the aggregation structure offer?
It really depends on the complexity of your card / reduction. In general, the aggregation structure is faster (and in some cases significantly). Your best bet is to make a comparison for your own use case.
MongoDB 2.2 is not officially released, but 2.2rc0 has been available since mid-July.
Does the aggregation structure provide the ability to save results in stages in the same way as the existing map / reduce functionality.
Currently, the aggregation structure is limited to returning results to the line, so you need to process / display the results when they are returned. The results document is also limited by the maximum document size in MongoDB (currently 16 MB).
There is a suggested $out pipeline command ( SERVER-3253 ), which is likely to be added in the future for additional output options.
Some additional readings that may be of interest: