MongoDB: Terrible MapReduce Performance

Question

MongoDB: Terrible MapReduce Performance

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm pretty sure that I should be doing something wrong. I will go straight to the question. Sorry if it is long.

I have a database in MySQL that tracks the number of member profile views for each day. He has 10,000,000 lines for testing.

CREATE TABLE `profile_views` ( `id` int(10) unsigned NOT NULL auto_increment, `username` varchar(20) NOT NULL, `day` date NOT NULL, `views` int(10) unsigned default '0', PRIMARY KEY (`id`), UNIQUE KEY `username` (`username`,`day`), KEY `day` (`day`) ) ENGINE=InnoDB;

Typical data might look like this.

 +--------+----------+------------+------+ | id | username | day | hits | +--------+----------+------------+------+ | 650001 | Joe | 2010-07-10 | 1 | | 650002 | Jane | 2010-07-10 | 2 | | 650003 | Jack | 2010-07-10 | 3 | | 650004 | Jerry | 2010-07-10 | 4 | +--------+----------+------------+------+

I use this query to get the top 5 most viewed profiles from 2010-07-16.

 SELECT username, SUM(hits) FROM profile_views WHERE day > '2010-07-16' GROUP BY username ORDER BY hits DESC LIMIT 5\G

This request completes in less than a minute. Not bad!

Now let's move on to the world of MongoDB. I am setting up a plastered environment using 3 servers. Servers M, S1 and S2. I used the following commands to install the rig (Note: I closed the IP add-ons).

 S1 => 127.20.90.1 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log S2 => 127.20.90.7 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log M => 127.20.4.1 ./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log ./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog

As soon as those were launched, I jumped to the M server and launched the mongo. I issued the following commands:

 use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } ); use profiles db.views.ensureIndex({ hits: -1 });

Then I imported the same 10,000,000 rows from MySQL that provided me with documents that look like this:

 { "_id" : ObjectId("4cb8fc285582125055295600"), "username" : "Joe", "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)", "hits" : 16 }

Now comes the real meat and potatoes ... My map and functions are diminishing. Back to server M in the shell I configure the request and execute it as follows.

 use profiles; var start = new Date(2010, 7, 16); var map = function() { emit(this.username, this.hits); } var reduce = function(key, values) { var sum = 0; for(var i in values) sum += values[i]; return sum; } res = db.views.mapReduce( map, reduce, { query : { day: { $gt: start }} } );

And here I ran into problems. This request took more than 15 minutes! MySQL query took a minute. Here's the conclusion:

 { "result" : "tmp.mr.mapreduce_1287207199_6", "shardCounts" : { "127.20.90.7:10000" : { "input" : 4917653, "emit" : 4917653, "output" : 1105648 }, "127.20.90.1:10000" : { "input" : 5082347, "emit" : 5082347, "output" : 1150547 } }, "counts" : { "emit" : NumberLong(10000000), "input" : NumberLong(10000000), "output" : NumberLong(2256195) }, "ok" : 1, "timeMillis" : 811207, "timing" : { "shards" : 651467, "final" : 159740 }, }

Not only will you run forever, but the results don't even seem right.

 db[res.result].find().sort({ hits: -1 }).limit(5); { "_id" : "Joe", "value" : 128 } { "_id" : "Jane", "value" : 2 } { "_id" : "Jerry", "value" : 2 } { "_id" : "Jack", "value" : 2 } { "_id" : "Jessy", "value" : 3 }

I know that these value numbers should be much higher.

My understanding of the entire MapReduce paradigm is that the task of executing this request should be shared among all shard members, which should improve performance. I waited for Mongo to hand out documents between the two shard servers after importing. Each of them had almost exactly 5,000,000 documents when I started this request.

So I have to do something wrong. Can anyone give me any directions?

Edit: Someone from the IRC noted adding an index to the day field, but as far as I can tell, MongoDB did this automatically.

+42

mongodb nosql mapreduce

mellowsoon Oct. 16 '10 at 6:11

source share

4 answers

I may be late, but ...

First you request a collection to populate MapReduce without an index. You create an index on day.

MongoDB MapReduce is single-threaded on a single server, but parallelized on shards. Data in shards of mangoes are stored together in adjacent pieces sorted by key.

Since your key is fragments of "day" and you request it, you are probably using only one of your three servers. The close key is used only for data distribution. Map Reduce will query using the "day" index on each shard and will be very fast.

Add something in front of the key of the day to distribute the data. Username can be a good choice.

Thus, the map reduction will be launched on all servers and, I hope, will reduce the time by three.

Something like that:

 use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } ); use profiles db.views.ensureIndex({ hits: -1 }); db.views.ensureIndex({ day: -1 });

I think that with these additions you can match MySQL speed even faster.

In addition, it is best not to use it in real time. If your data does not need to be “precisely” refined, distribute the task of reducing the map each time and then use the collection of results.

+27

FrameGrace Dec 01 2018-10-01

source share

You are not doing anything wrong. (Besides sorting by the wrong value, as you already noticed in your comments.)

The MongoDB map / performance degradation is simply not that great. This is a known issue; see, for example, http://jira.mongodb.org/browse/SERVER-1197 , where the naive approach is ~ 350 times faster than M / R.

One advantage is that you can specify the name of a constant output collection with the out argument of the mapReduce call. Once M / R is completed, the temporary collection will be renamed to the permanent name atomically. This way you can plan your statistics updates and request a real-time collection of M / R output.

+6

Joris Bontje Oct. 16 '10 at 17:29

source share

Have you tried using the connector for mongodb yet?

Have a look at this link here: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

Since you only use 3 shards, I don't know if this approach will improve your case.

0

Rogerio Hilbert Feb 17 '14 at 6:19 06:19

source share

太極者無極而生 · Accepted Answer · 2010-10-17 03:18

excerpts from the final MongoDB tutorial from O'Reilly:

The cost of using MapReduce is speed: the group is not particularly fast, but MapReduce is slower and not supposed to be used in "real time." You run MapReduce as the background of the work, it creates a collection of results, and then you can query the collections in real time.

 options for map/reduce: "keeptemp" : boolean If the temporary result collection should be saved when the connection is closed. "output" : string Name for the output collection. Setting this option implies keeptemp : true.

MongoDB: Terrible MapReduce Performance

More articles: