How to group by various fields

Question

How to group by various fields

I want to find all users with the name "Hans" and fill in their "age" and the number of "children", grouping them. Assuming I have the following users in my database.

{ "_id" : "01", "user" : "Hans", "age" : "50" "childs" : "2" } { "_id" : "02", "user" : "Hans", "age" : "40" "childs" : "2" } { "_id" : "03", "user" : "Fritz", "age" : "40" "childs" : "2" } { "_id" : "04", "user" : "Hans", "age" : "40" "childs" : "1" }

The result should be something like this:

 "result" : [ { "age" : [ { "value" : "50", "count" : "1" }, { "value" : "40", "count" : "2" } ] }, { "childs" : [ { "value" : "2", "count" : "2" }, { "value" : "1", "count" : "1" } ] } ]

How can i achieve this?

+6

mongodb aggregation-framework

Michael K. Jun 09 '15 at 11:32

source share

2 answers

This was hard!

First, a bare solution:

 db.test.aggregate([ { "$match": { "user": "Hans" } }, // duplicate each document: one for "age", the other for "childs" { $project: { age: "$age", childs: "$childs", data: {$literal: ["age", "childs"]}}}, { $unwind: "$data" }, // pivot data to something like { data: "age", value: "40" } { $project: { data: "$data", value: {$cond: [{$eq: ["$data", "age"]}, "$age", "$childs"]} }}, // Group by data type, and count { $group: { _id: {data: "$data", value: "$value" }, count: { $sum: 1 }, value: {$first: "$value"} }}, // aggregate values in an array for each independant (type,value) pair { $group: { _id: "$_id.data", values: { $push: { count: "$count", value: "$value" }} }} , // project value to the correctly name field { $project: { result: {$cond: [{$eq: ["$_id", "age"]}, {age: "$values" }, {childs: "$values"}]} }}, // group all data in the result array, and remove unneeded `_id` field { $group: { _id: null, result: { $push: "$result" }}}, { $project: { _id: 0, result: 1}} ])

Production:

 { "result" : [ { "age" : [ { "count" : 3, "value" : "40" }, { "count" : 1, "value" : "50" } ] }, { "childs" : [ { "count" : 1, "value" : "1" }, { "count" : 3, "value" : "2" } ] } ] }

And now, for some explanation:

One of the main problems is that each incoming document must be part of two different amounts. I decided that adding a literal array ["age", "childs"] to your documents, and then unwinding them with this array. Thus, each document will be submitted twice at a later stage.

After that, to facilitate processing, I change the presentation of the data to something more manageable, for example { data: "age", value: "40" }

The following steps will perform data aggregation per se. Until the third step, $project , which maps the value fields to the corresponding age or childs field.

The last two steps simply wrap the two documents in one, removing the unnecessary _id field.

Pffff!

+2

Sylvain leroux Jun 09 '15 at 12:30

source share

user3561036 · Accepted Answer · 2015-06-09T13:48:27+0000

This is likely to frequently ask MongoDB questions, mainly because this is a real rough idea of how you should change your mindset while processing SQL and cover what MongoDB engines do.

The basic principle here is "MongoDB does not make connections." Any way to “represent” how you built SQL for this essentially requires a join operation. The typical form is “UNION”, which is actually a “union”.

So how to do it under a different paradigm? First, let me get closer to how not to do this and understand the reasons. Even if, of course, this works for your very small sample:

Hard way

 db.docs.aggregate([ { "$group": { "_id": null, "age": { "$push": "$age" }, "childs": { "$push": "$childs" } }}, { "$unwind": "$age" }, { "$group": { "_id": "$age", "count": { "$sum": 1 }, "childs": { "$first": "$childs" } }}, { "$sort": { "_id": -1 } }, { "$group": { "_id": null, "age": { "$push": { "value": "$_id", "count": "$count" }}, "childs": { "$first": "$childs" } }}, { "$unwind": "$childs" }, { "$group": { "_id": "$childs", "count": { "$sum": 1 }, "age": { "$first": "$age" } }}, { "$sort": { "_id": -1 } }, { "$group": { "_id": null, "age": { "$first": "$age" }, "childs": { "$push": { "value": "$_id", "count": "$count" }} }} ])

This will give you the result as follows:

 { "_id" : null, "age" : [ { "value" : "50", "count" : 1 }, { "value" : "40", "count" : 3 } ], "childs" : [ { "value" : "2", "count" : 3 }, { "value" : "1", "count" : 1 } ] }

So why is it bad? The main problem should be obvious at the very first stage of the pipeline:

  { "$group": { "_id": null, "age": { "$push": "$age" }, "childs": { "$push": "$childs" } }},

We asked here to make the group everything in the collection for the desired values and $push these results into an array. When everything is small, then it works, but the collections of the real world will lead to the fact that this "only document" in the pipeline will exceed the allowable limit of BSON by 16 MB. That's the bad thing.

The rest of the logic follows a natural course, working with each array. But of course, real-world scenarios almost always make this untenable.

You could avoid this somewhat by doing things like “duplicating” documents of type “type” or “child” and grouping documents by type. But all this is a little "too complicated", not a solid way to do things.

The natural answer is “what about UNION?”, But since MongoDB does not perform a “connection”, how can this be done?

The best way (aka A New Hope)

Your best approach here, both in architecture and in performance, is to simply send “both” requests (yes two) “in parallel” to the server through your client API. As you receive the results, you then “combine” them into one answer, after which you can send back as a data source to your possible “client” application.

Different languages have different approaches to this, but in general, you need to look for an "asynchronous processing" API that allows you to do this in tandem.

In my example, node.js is used here because the “asynchronous” side is basically “embedded” and reasonably intuitive. The "combined" side of things can be any implementation of the hash / map / dict table, just doing it in a simple way, for example:

 var async = require('async'), MongoClient = require('mongodb'); MongoClient.connect('mongodb://localhost/test',function(err,db) { var collection = db.collection('docs'); async.parallel( [ function(callback) { collection.aggregate( [ { "$group": { "_id": "$age", "type": { "$first": { "$literal": "age" } }, "count": { "$sum": 1 } }}, { "$sort": { "_id": -1 } } ], callback ); }, function(callback) { collection.aggregate( [ { "$group": { "_id": "$childs", "type": { "$first": { "$literal": "childs" } }, "count": { "$sum": 1 } }}, { "$sort": { "_id": -1 } } ], callback ); } ], function(err,results) { if (err) throw err; var response = {}; results.forEach(function(res) { res.forEach(function(doc) { if ( !response.hasOwnProperty(doc.type) ) response[doc.type] = []; response[doc.type].push({ "value": doc._id, "count": doc.count }); }); }); console.log( JSON.stringify( response, null, 2 ) ); } ); });

Which gives a nice result:

 { "age": [ { "value": "50", "count": 1 }, { "value": "40", "count": 3 } ], "childs": [ { "value": "2", "count": 3 }, { "value": "1", "count": 1 } ] }

So, the main thing to note here is that the “separate” statements of aggregation themselves are actually quite simple. The only thing you encounter is to combine them in the end result. There are many approaches to “merging”, especially for handling large results from each of the queries, but this is a basic example of a execution model.

Key points here.

Shuffling data in the aggregation pipeline is possible, but not performed for large data sets.
Use a language implementation and APIs that support "parallel" and "asynchronous" execution so that you can simultaneously "load" all or most of your operations.
The API must support some “combination” method or otherwise allow a separate “stream” record to process each result set in one.
Forget the SQL method. The NoSQL path delegates the processing of things such as “joining” to your “data logic word,” which contains the code, as shown here. It does this because it scales to very large datasets. Rather, it is the work of your "data logic" that processes nodes in large applications to deliver this to the end of the API.

This is quick compared to any other form of “dispute” that I could describe. Part of thinking “NoSQL” is to “Free What You Learned,” and look at things differently. And if this method doesn't work better, then stick with the SQL approach for storage and query.

That is why there are alternatives.

How to group by various fields

Hard way

The best way (aka A New Hope)

More articles: