I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm pretty sure that I should be doing something wrong. I will go straight to the question. Sorry if it is long.
I have a database in MySQL that tracks the number of member profile views for each day. He has 10,000,000 lines for testing.
CREATE TABLE `profile_views` ( `id` int(10) unsigned NOT NULL auto_increment, `username` varchar(20) NOT NULL, `day` date NOT NULL, `views` int(10) unsigned default '0', PRIMARY KEY (`id`), UNIQUE KEY `username` (`username`,`day`), KEY `day` (`day`) ) ENGINE=InnoDB;
Typical data might look like this.
+--------+----------+------------+------+ | id | username | day | hits | +--------+----------+------------+------+ | 650001 | Joe | 2010-07-10 | 1 | | 650002 | Jane | 2010-07-10 | 2 | | 650003 | Jack | 2010-07-10 | 3 | | 650004 | Jerry | 2010-07-10 | 4 | +--------+----------+------------+------+
I use this query to get the top 5 most viewed profiles from 2010-07-16.
SELECT username, SUM(hits) FROM profile_views WHERE day > '2010-07-16' GROUP BY username ORDER BY hits DESC LIMIT 5\G
This request completes in less than a minute. Not bad!
Now let's move on to the world of MongoDB. I am setting up a plastered environment using 3 servers. Servers M, S1 and S2. I used the following commands to install the rig (Note: I closed the IP add-ons).
S1 => 127.20.90.1 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log S2 => 127.20.90.7 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log M => 127.20.4.1 ./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log ./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog
As soon as those were launched, I jumped to the M server and launched the mongo. I issued the following commands:
use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } ); use profiles db.views.ensureIndex({ hits: -1 });
Then I imported the same 10,000,000 rows from MySQL that provided me with documents that look like this:
{ "_id" : ObjectId("4cb8fc285582125055295600"), "username" : "Joe", "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)", "hits" : 16 }
Now comes the real meat and potatoes ... My map and functions are diminishing. Back to server M in the shell I configure the request and execute it as follows.
use profiles; var start = new Date(2010, 7, 16); var map = function() { emit(this.username, this.hits); } var reduce = function(key, values) { var sum = 0; for(var i in values) sum += values[i]; return sum; } res = db.views.mapReduce( map, reduce, { query : { day: { $gt: start }} } );
And here I ran into problems. This request took more than 15 minutes! MySQL query took a minute. Here's the conclusion:
{ "result" : "tmp.mr.mapreduce_1287207199_6", "shardCounts" : { "127.20.90.7:10000" : { "input" : 4917653, "emit" : 4917653, "output" : 1105648 }, "127.20.90.1:10000" : { "input" : 5082347, "emit" : 5082347, "output" : 1150547 } }, "counts" : { "emit" : NumberLong(10000000), "input" : NumberLong(10000000), "output" : NumberLong(2256195) }, "ok" : 1, "timeMillis" : 811207, "timing" : { "shards" : 651467, "final" : 159740 }, }
Not only will you run forever, but the results don't even seem right.
db[res.result].find().sort({ hits: -1 }).limit(5); { "_id" : "Joe", "value" : 128 } { "_id" : "Jane", "value" : 2 } { "_id" : "Jerry", "value" : 2 } { "_id" : "Jack", "value" : 2 } { "_id" : "Jessy", "value" : 3 }
I know that these value numbers should be much higher.
My understanding of the entire MapReduce paradigm is that the task of executing this request should be shared among all shard members, which should improve performance. I waited for Mongo to hand out documents between the two shard servers after importing. Each of them had almost exactly 5,000,000 documents when I started this request.
So I have to do something wrong. Can anyone give me any directions?
Edit: Someone from the IRC noted adding an index to the day field, but as far as I can tell, MongoDB did this automatically.