Getting differences in two different structured collections

Question

Getting differences in two different structured collections

Suppose I have two collections: A and B

A contains simple documents of the following form:

 { _id: '...', value: 'A', data: '...' } { _id: '...', value: 'B', data: '...' } { _id: '...', value: 'C', data: '...' } …

B contains nested objects like this:

 { _id: '...', values: [ 'A', 'B' ]} { _id: '...', values: [ 'C' ]} …

Now it may happen that in A there are documents that are not referenced by any document in B , or that in B there are referenced documents that do not exist in A

Call them "orphaned."

Now my question is: how can I find those orphaned documents in the most efficient way? In the end, I need this _id field.

So far, I have used unwind to “smooth out” A and calculated the difference using the differenceWith function from Ramda ., But it takes a lot of time and certainly not very efficient, since I do all the work with the client, and not in the database .

I saw that in MongoDB there is a $setDifference , but I did not get it to work.

Can someone point me in the right direction, how to solve these problems using Node.js, and run most (all?) Of the work in the database? Any hints are welcome :-)

+6

node.js mongodb aggregation-framework

Golo roden Jun 25 '15 at 9:51

source share

1 answer

cessor · Accepted Answer · 2015-06-25T10:46:47+0000

In MongoDb, you can use the aggregation pipeline for what you are trying. If this does not help, you can use MapReduce, but it is a bit more complicated.

In this example, I named the two collections "Tags" and "Documents", where in your example the tags are called "B" and "Papers" is called "A".

First, we get a set of values that actually exist and refer to documents. To do this, we smooth out each value in the tag collection and put it together. Unwinding creates a document with the original _id for each value in the "values" array. This flat list is then remembered, and their identifiers are ignored.

  var referenced_tags = db.tags.aggregate( {$unwind: '$values'}, {$group: { _id: '', tags: { $push: '$values'} } });

This returns:

 { "_id" : "", "tags" : [ "A", "B", "C"] }

This list is a collection of all values in all documents.

Then you create a similar collection containing a set of tags for available documents. This does not require the unwinding step, since _id is a scalar value (= not a list)

 var papers = db.papers.aggregate( {$group: { _id: '', tags: {$push: '$value'} } });

getting

 { "_id" : "", "tags" : [ "A", "B", "C", "D"] }

As you can see, from the set that I put in the database, document A has a document (document) with the identifier "D", which is not mentioned in the tag collection and is an orphan,

Now you can calculate the difference that you like, it may be slow, but is suitable as an example:

 var a = referenced_tags.tags; var b = tags.tags; var delta = a.filter(function (v) { return b.indexOf(v) < 0; });

As the next step, you can find the identifiers by looking for these values in delta and projecting only their identifiers:

 db.papers.find({'value' : {'$in': delta}}, {'_id': 1})

Return:

 { "_id" : ObjectId("558bd2...44f6a") }

EDIT: Although this shows perfectly how to approach this problem using the aggregation structure, this is not a possible solution. One doesn't even need aggregation, since MongoDb is pretty smart:

 db.papers.find({'value' : {'$nin': tags.values }}, {'_id': 1})

Where are the tags

 var cursor = db.tags.find(); var tags = cursor.hasNext() : cusor.next() : null;

As pointed out by @ karthick.k

Getting differences in two different structured collections

More articles: