Mongodb aggregate size and assembly speed

I am trying to use the mongodb aggregation query to combine ($ lookup) two collections, and then the different values ​​of all unique values ​​in a combined array. * Note. I do not necessarily know which fields (keys) are in the metaDataMap array. And I do not want to read or include fields that may or may not exist on the Map. So why the aggregation request looks like this.

So, my two collections look like this: events -

{ "_id" : "1", "name" : "event1", "objectsIds" : [ "1", "2", "3" ], } 

The objects

 { "_id" : "1", "name" : "object1", "metaDataMap" : { "SOURCE" : ["ABC", "DEF"], "DESTINATION" : ["XYZ", "PDQ"], "TYPE" : [] } }, { "_id" : "2", "name" : "object2", "metaDataMap" : { "SOURCE" : ["RST", "LNE"], "TYPE" : ["text"] } }, { "_id" : "3", "name" : "object3", "metaDataMap" : { "SOURCE" : ["NOP"], "DESTINATION" : ["PHI", "NYC"], "TYPE" : ["video"] } } 

My results

 { _id:"SOURCE", count:5 _id:"DESTINATION", count: 4 _id:"TYPE", count: 2 } 

What I still know:

 db.events.aggregate([ {$match: {"_id" : id}} ,{$lookup: {"from" : "objects", "localField" : "objectsIds", "foreignField" : "_id", "as" : "objectResults"}} ,{$unwind: "$objectResults"} //Line 1 ,{$project: {x: "$objectResults.metaDataMap"}} //Line 2 ,{$unwind: "$x"} ,{$project: {"_id":0}} ,{$project: {x: {$objectToArray: "$x"}}} ,{$unwind: "$x"} ,{$group: {_id: "$xk", tmp: {$push: "$xv"}}} ,{$addFields: {tmp: {$reduce:{ input: "$tmp", initialValue:[], in:{$concatArrays: [ "$$value", "$$this"]} }} }} ,{$unwind: "$tmp"} ,{$group: {_id: "$_id", uniqueVals: {$addToSet: "$tmp"}}} ,{$addFields: {count: {"$size":"$uniqueVals"}}} ,{$project: {_id: "$_id", count: "$count"}} ]); 

My problem is that I am marked with lines 1 and 2. The above works, but takes about 50 seconds for 25,000 values ​​in the fields of the metaDataMap array (objectsResults.metaDataMap). So, for example, having 25,000 values ​​in the SOURCE array of the metaDataMap object of object 1. This way to slow down. Another quicker way to do this is to replace line 1 & 2 with:

  ,{$project: {x: "$objectResults.metaDataMap"}} //Line 1 ,{$unwind: "$x"} //Line 2 

This is faster (less than 3 seconds), but can only be performed on datasets that have ~ 10,000 elements or less. All of the above, and I get the error "exceeds the maximum document size."

Please, help!

+5
source share
1 answer

If you can change the design of the schema in the object collection by including the parent_id field in it, you can immediately remove the first 4 stages of your pipeline (the first $match , $lookup , $unwind and $project ). This will eliminate the problems of Line 1 and Line 2 .

For example, a document in the object collection would look like this:

 { "_id": "1", "name": "object1", "metaDataMap": { "SOURCE": [ "ABC", "DEF" ], "DESTINATION": [ "XYZ", "PDQ" ], "TYPE": [ ] }, "parent_id": "1" } 

Thus, you do not need expensive $lookup and $unwind . The first 4 stages can be replaced by:

 {$match: {parent_id: id}} 

Based on this idea, I made further pipeline optimization, resulting in:

 db.objects.aggregate([ {$match: {parent_id: id}} ,{$project: {metaDataMap: {$filter: {input: {$objectToArray: '$metaDataMap'}, cond: {$ne: [[], '$$this.v']}}}}} ,{$unwind: '$metaDataMap'} ,{$unwind: '$metaDataMap.v'} ,{$group: {_id: '$metaDataMap.k', val: {$addToSet: '$metaDataMap.v'}}} ,{$project: {count: {$size: '$val'}}} ]) 

This will output:

 { "_id": "TYPE", "count": 2 } { "_id": "DESTINATION", "count": 4 } { "_id": "SOURCE", "count": 5 } 
0
source

Source: https://habr.com/ru/post/1272200/


All Articles