Mongodb component by subdocument in array

Question

Mongodb component by subdocument in array

I am implementing a small application using mongodb as a backend. In this application, I have a data structure in which documents will contain a field containing an array of subdocuments.

As a basis, I use the following use case: http://docs.mongodb.org/manual/use-cases/inventory-management/

As you can see from the example, each document has a field called carted, which is an array of subdocuments.

{ _id: 42, last_modified: ISODate("2012-03-09T20:55:36Z"), status: 'active', items: [ { sku: '00e8da9b', qty: 1, item_details: {...} }, { sku: '0ab42f88', qty: 4, item_details: {...} } ] }

This fits me perfectly, except for one problem: I want to count every unique element (with "sku" as the key of a unique identifier) in the entire collection, where each document adds a count of 1 (multiple instances of the same "sku" in the same document will still be considered 1). For instance. I would like to get this result:

{sku: '00e8da9b', doc_count: 1}, {sku: '0ab42f88', doc_count: 9}

After reading on MongoDB, I was pretty confused about how to do this (quickly) when you have a complex circuit, as described above. If I realized that otherwise the excellent documentation is correct, such an operation can be achieved using either the aggregation structure or the map / reduce structure, but here I need some input:

Which structure is best suited to achieve the result I'm looking for, given the complexity of the structure?
Which indexes would be preferable to get the best performance from the selected structure?

+4

mongodb mapreduce aggregation-framework

agnsaft Oct 25 '12 at 17:12

source share

2 answers

With the latest mongo build (this may be true for other builds as well), I found that a slightly different version of cirrus answer is faster and consumes less memory. I don’t know the details of why it seems that with this version of mongo somehow has more options for optimizing the pipeline.

 db.so.runCommand("aggregate", { pipeline: [ { $unwind: "$items" }, { $group: { // create array of unique sku (or set) per id _id: { id: "$_id"}, sku: {$addToSet: "$items.sku"} } }, // unroll all sets { $unwind: "$sku" }, { $group: { // then count unique values per each Id _id: { id: "$_id.id", sku:"$sku" }, count: { $sum: 1 }, } } ] })

to match exactly in the same format as the question asked, grouping by "_id" must be omitted

+2

Volodymyr metlyakov Mar 05 '14 at 15:37

source share

cirrus · Accepted Answer · 2012-10-25T20:02:31+0000

MapReduce is slow, but can handle very large datasets. On the other hand, the structure of Aggregation is a little faster, but will deal with large amounts of data.

The problem with the structure shown is that you need to unwind the arrays to crack the data. This means creating a new document for each element of the array and using the aggregation structure that must be performed in memory. Therefore, if you have 1000 documents with 100 array elements, he will need to create a stream of 100,000 documents in order for groupBy to count them.

You might want to consider whether there is a layout scheme that will better handle your requests, but if you want to do it using the Aggregation structure here, how could you do it (with some sample data so that the whole script falls into shell);

 db.so.remove(); db.so.ensureIndex({ "items.sku": 1}, {unique:false}); db.so.insert([ { _id: 42, last_modified: ISODate("2012-03-09T20:55:36Z"), status: 'active', items: [ { sku: '00e8da9b', qty: 1, item_details: {} }, { sku: '0ab42f88', qty: 4, item_details: {} }, { sku: '0ab42f88', qty: 4, item_details: {} }, { sku: '0ab42f88', qty: 4, item_details: {} }, ] }, { _id: 43, last_modified: ISODate("2012-03-09T20:55:36Z"), status: 'active', items: [ { sku: '00e8da9b', qty: 1, item_details: {} }, { sku: '0ab42f88', qty: 4, item_details: {} }, ] }, ]); db.so.runCommand("aggregate", { pipeline: [ { // optional filter to exclude inactive elements - can be removed // you'll want an index on this if you use it too $match: { status: "active" } }, // unwind creates a doc for every array element { $unwind: "$items" }, { $group: { // group by unique SKU, but you only wanted to count a SKU once per doc id _id: { _id: "$_id", sku: "$items.sku" }, } }, { $group: { // group by unique SKU, and count them _id: { sku:"$_id.sku" }, doc_count: { $sum: 1 }, } } ] //,explain:true })

Note that I have $ group'd twice because you said that SKU can only count once per document, so we need to sort the unique doc / sku pairs first and then count them.

If you want the result to be slightly different (in other words, EXACTLY, as in your example), we can $ design them.

Mongodb component by subdocument in array

More articles: