This is likely to frequently ask MongoDB questions, mainly because this is a real rough idea of ​​how you should change your mindset while processing SQL and cover what MongoDB engines do.
The basic principle here is "MongoDB does not make connections." Any way to “represent” how you built SQL for this essentially requires a join operation. The typical form is “UNION”, which is actually a “union”.
So how to do it under a different paradigm? First, let me get closer to how not to do this and understand the reasons. Even if, of course, this works for your very small sample:
Hard way
db.docs.aggregate([ { "$group": { "_id": null, "age": { "$push": "$age" }, "childs": { "$push": "$childs" } }}, { "$unwind": "$age" }, { "$group": { "_id": "$age", "count": { "$sum": 1 }, "childs": { "$first": "$childs" } }}, { "$sort": { "_id": -1 } }, { "$group": { "_id": null, "age": { "$push": { "value": "$_id", "count": "$count" }}, "childs": { "$first": "$childs" } }}, { "$unwind": "$childs" }, { "$group": { "_id": "$childs", "count": { "$sum": 1 }, "age": { "$first": "$age" } }}, { "$sort": { "_id": -1 } }, { "$group": { "_id": null, "age": { "$first": "$age" }, "childs": { "$push": { "value": "$_id", "count": "$count" }} }} ])
This will give you the result as follows:
{ "_id" : null, "age" : [ { "value" : "50", "count" : 1 }, { "value" : "40", "count" : 3 } ], "childs" : [ { "value" : "2", "count" : 3 }, { "value" : "1", "count" : 1 } ] }
So why is it bad? The main problem should be obvious at the very first stage of the pipeline:
{ "$group": { "_id": null, "age": { "$push": "$age" }, "childs": { "$push": "$childs" } }},
We asked here to make the group everything in the collection for the desired values ​​and $push these results into an array. When everything is small, then it works, but the collections of the real world will lead to the fact that this "only document" in the pipeline will exceed the allowable limit of BSON by 16 MB. That's the bad thing.
The rest of the logic follows a natural course, working with each array. But of course, real-world scenarios almost always make this untenable.
You could avoid this somewhat by doing things like “duplicating” documents of type “type” or “child” and grouping documents by type. But all this is a little "too complicated", not a solid way to do things.
The natural answer is “what about UNION?”, But since MongoDB does not perform a “connection”, how can this be done?
The best way (aka A New Hope)
Your best approach here, both in architecture and in performance, is to simply send “both” requests (yes two) “in parallel” to the server through your client API. As you receive the results, you then “combine” them into one answer, after which you can send back as a data source to your possible “client” application.
Different languages ​​have different approaches to this, but in general, you need to look for an "asynchronous processing" API that allows you to do this in tandem.
In my example, node.js is used here because the “asynchronous” side is basically “embedded” and reasonably intuitive. The "combined" side of things can be any implementation of the hash / map / dict table, just doing it in a simple way, for example:
var async = require('async'), MongoClient = require('mongodb'); MongoClient.connect('mongodb://localhost/test',function(err,db) { var collection = db.collection('docs'); async.parallel( [ function(callback) { collection.aggregate( [ { "$group": { "_id": "$age", "type": { "$first": { "$literal": "age" } }, "count": { "$sum": 1 } }}, { "$sort": { "_id": -1 } } ], callback ); }, function(callback) { collection.aggregate( [ { "$group": { "_id": "$childs", "type": { "$first": { "$literal": "childs" } }, "count": { "$sum": 1 } }}, { "$sort": { "_id": -1 } } ], callback ); } ], function(err,results) { if (err) throw err; var response = {}; results.forEach(function(res) { res.forEach(function(doc) { if ( !response.hasOwnProperty(doc.type) ) response[doc.type] = []; response[doc.type].push({ "value": doc._id, "count": doc.count }); }); }); console.log( JSON.stringify( response, null, 2 ) ); } ); });
Which gives a nice result:
{ "age": [ { "value": "50", "count": 1 }, { "value": "40", "count": 3 } ], "childs": [ { "value": "2", "count": 3 }, { "value": "1", "count": 1 } ] }
So, the main thing to note here is that the “separate” statements of aggregation themselves are actually quite simple. The only thing you encounter is to combine them in the end result. There are many approaches to “merging”, especially for handling large results from each of the queries, but this is a basic example of a execution model.
Key points here.
Shuffling data in the aggregation pipeline is possible, but not performed for large data sets.
Use a language implementation and APIs that support "parallel" and "asynchronous" execution so that you can simultaneously "load" all or most of your operations.
The API must support some “combination” method or otherwise allow a separate “stream” record to process each result set in one.
Forget the SQL method. The NoSQL path delegates the processing of things such as “joining” to your “data logic word,” which contains the code, as shown here. It does this because it scales to very large datasets. Rather, it is the work of your "data logic" that processes nodes in large applications to deliver this to the end of the API.
This is quick compared to any other form of “dispute” that I could describe. Part of thinking “NoSQL” is to “Free What You Learned,” and look at things differently. And if this method doesn't work better, then stick with the SQL approach for storage and query.
That is why there are alternatives.