Now I am doing some analytical analysis of the data, and in the first, itβs really simple, I have very strange results.
The idea is as follows: from the Internet access log (collection with a document for each access, for testing - 90 million documents). I want to get the access number by domain (which will be GROUP BY in MySQL) and get the 10 most accessible domains
The script I made in JavaScript is very simple:
m = function () { emit(this.domain, 1 ); } r = function (key, values) { total = 0; for (var i in values) { total += Number(i); } return total; } res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } ); db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1}); db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);
There is an equivalent in MySQL:
drop table if exists NonFTP_Access_log_domain_visits; create table NonFTP_Access_log_domain_visits ( `domain` varchar(255) NOT NULL, `value` int unsigned not null, PRIMARY KEY (`domain`), KEY `value_index` (`value`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 select domain, count(*) as value from NonFTP_Access_log group by domain; select * from NonFTP_Access_log_domain_visits order by value desc limit 10;
Well, MongoDB takes 30 hours to get results and MySQL in 20 minutes! . After reading a little, I came to the conclusion that for data analysis we will need to use Hadoop, since MongoDB is really slow. Answers to such questions say that:
- MongoDB uses only thread
- Javascript is too slow
What am I doing wrong? Are these results normal? Should I use Hadoop?
We run this test in the following environment:
- Operating system: Suse Linux Enterprise Server 10 (virtual server on Xen)
- RAM: 10 GB
- Kernels: 32 (AMD Opteron 6128 processor)
mongodb mapreduce hadoop
Ciges Aug 27 '12 at 9:13 2012-08-27 09:13
source share