MapReduce with MongoDB is really, really slow (30 hours versus 20 minutes in MySQL for an equivalent database)

Question

MapReduce with MongoDB is really, really slow (30 hours versus 20 minutes in MySQL for an equivalent database)

Now I am doing some analytical analysis of the data, and in the first, it’s really simple, I have very strange results.

The idea is as follows: from the Internet access log (collection with a document for each access, for testing - 90 million documents). I want to get the access number by domain (which will be GROUP BY in MySQL) and get the 10 most accessible domains

The script I made in JavaScript is very simple:

/* Counts each domain url */ m = function () { emit(this.domain, 1 ); } r = function (key, values) { total = 0; for (var i in values) { total += Number(i); } return total; } /* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */ res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } ); db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1}); db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);

There is an equivalent in MySQL:

 drop table if exists NonFTP_Access_log_domain_visits; create table NonFTP_Access_log_domain_visits ( `domain` varchar(255) NOT NULL, `value` int unsigned not null, PRIMARY KEY (`domain`), KEY `value_index` (`value`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 select domain, count(*) as value from NonFTP_Access_log group by domain; select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

Well, MongoDB takes 30 hours to get results and MySQL in 20 minutes! . After reading a little, I came to the conclusion that for data analysis we will need to use Hadoop, since MongoDB is really slow. Answers to such questions say that:

MongoDB uses only thread
Javascript is too slow

What am I doing wrong? Are these results normal? Should I use Hadoop?

We run this test in the following environment:

Operating system: Suse Linux Enterprise Server 10 (virtual server on Xen)
RAM: 10 GB
Kernels: 32 (AMD Opteron 6128 processor)

+8

mongodb mapreduce hadoop

Ciges Aug 27 '12 at 9:13

source share

3 answers

Adam Comerford · Answer 1 · 2012-08-27 10:04

I really answered this very similar question . The limitations of Map Reduce in MongoDB have been outlined earlier - as you mentioned, it is single-threaded, it needs to be converted to Java Script (spidermonkey) and vice versa, etc.

This is why there are other options:

MongoDB Hadoop Connector (officially supported)
Aggregation structure (2.1 + required)

At the time of this writing, a stable release of version 2.2.0 has not yet been released, but that was before RC2, so a release should be inevitable. I would recommend making this a shot as a more meaningful comparison for this type of testing.

Ciges · Answer 2 · 2012-08-28 08:50

Obviously, using a group function on an Aggregate structure works well! :-)

The following Javascript code gets the 10 most visited domains with a visit to 17m17s !

 db.NonFTP_Access_log.aggregate( { $group: { _id: "$domain", visits: { $sum: 1 } }}, { $sort: { visits: -1 } }, { $limit: 10 } ).result.forEach(printjson);

In any case, I still don't understand why the MapReduce alternative is so slow. I opened the following question in JIRA MongoDB .

David Gruzman · Answer 3 · 2012-08-27 19:55

I think your result is quite normal and will try to justify them <.br> 1. MySQL uses a binary format that is optimized for processing, while MongoDB works with JSON. Therefore, processing time is added to processing. I would rate it at least 10x.
2. JS is really much slower than C. I think you can take at least a factor of 10. Together we get about x100 - it looks like what you see. 20 minutes x 1000 is 2000 minutes or about 33 hours.
3. Hadoop is also not efficient for processing data, but it is able to use all the kernels that you have, and that matters. Java has also developed and optimized JIT for over 10 years.
4. I would suggest looking not at MySQL, but at the TPC-H Q1 test, which is pure aggregation. I think systems like VectorWise will show the highest possible throughput per core.

MapReduce with MongoDB is really, really slow (30 hours versus 20 minutes in MySQL for an equivalent database)

More articles: