How to remove duplicate object from MongoDB array?

My data is as follows:

{ "foo_list": [ { "id": "98aa4987-d812-4aba-ac20-92d1079f87b2", "name": "Foo 1", "slug": "foo-1" }, { "id": "98aa4987-d812-4aba-ac20-92d1079f87b2", "name": "Foo 1", "slug": "foo-1" }, { "id": "157569ec-abab-4bfb-b732-55e9c8f4a57d", "name": "Foo 3", "slug": "foo-3" } ] } 

Where foo_list is the field in the model called Bar . Note that the first and second objects in the array are complete duplicates.

Besides the obvious decision to switch to PostgresSQL, which MongoDB query can I run to remove duplicate entries from foo_list ?

Similar answers that don't quite cut:

These questions answer the question of whether there are simple strings in the array. However, in my situation, the array is filled with objects.

I hope it is clear that I am not interested in querying the database; I want duplicates to disappear from the database forever.

+5
source share
1 answer

Purely from the point of view of aggregation, there are several approaches to this.

You can either apply $setUnion in modern versions:

  db.collection.aggregate([ { "$project": { "foo_list": { "$setUnion": [ "$foo_list", "$foo_list" ] } }} ]) 

Or more traditionally with $unwind and $addToSet :

 db.collection.aggregate([ { "$unwind": "$foo_list" }, { "$group": { "_id": "$_id", "foo_list": { "$addToSet": "$foo_list" } }} ]) 

Or, if you were just interested in duplicates, then the general grouping:

 db.collection.aggregate([ { "$unwind": "$foo_list" }, { "$group": { "_id": { "_id": "$_id", "foo_list": "$foo_list" }, "count": { "$sum": 1 } }}, { "$match": { "count": { "$ne": 1 } } }, { "$group": { "_id": "$_id._id", "foo_list": { "$push": "$_id.foo_list" } }} ]) 

The latter form may be useful if you really want to β€œremove” duplicates from your data using another update statement, since it identifies elements that are duplicates.

So, in this last form, the return result from your sample data identifies a duplicate:

 { "_id" : ObjectId("53f5f7314ffa9b02cf01c076"), "foo_list" : [ { "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2", "name" : "Foo 1", "slug" : "foo-1" } ] } 

If the results are returned from your collection to a single document that contains duplicate entries in the array and which entries are duplicated. This is the information that needs to be updated, and you are looping the results, because you need to specify the update information from the results in order to remove duplicates.

In fact, this is done using two update statements per document, as a simple $pull will remove the "both" elements, which you do not need:

 var cursor = db.collection.aggregate([ { "$unwind": "$foo_list" }, { "$group": { "_id": { "_id": "$_id", "foo_list": "$foo_list" }, "count": { "$sum": 1 } }}, { "$match": { "count": { "$ne": 1 } } }, { "$group": { "_id": "$_id._id", "foo_list": { "$push": "$_id.foo_list" } }} ]) var batch = db.collection.initializeOrderedBulkOp(); var count = 0; cursor.forEach(function(doc) { doc.foo_list.forEach(function(dup) { batch.find({ "_id": doc._id, "foo_list": { "$elemMatch": dup } }).updateOne({ "$unset": { "foo_list.$": "" } }); batch.find({ "_id": doc._id }).updateOne({ "$pull": { "foo_list": null } }); ]); count++; if ( count % 500 == 0 ) { batch.execute(); batch = db.collection.initializeOrderedBulkOp(); } }); if ( count % 500 != 0 ) batch.execute(); 

This is modern MongoDB 2.6 and higher, to do this, use the cursor to result in aggregation and bulk operations for updates. But the principles remain the same:

  • Print duplicates in documents

  • Complete Results for Updating Updates for Corrupted Documents

  • Use $unset with positional $ to set the "first" matched array element to null

  • Use $pull to remove null entry from array

So, after processing the above operations, your sample now looks like this:

 { "_id" : ObjectId("53f5f7314ffa9b02cf01c076"), "foo_list" : [ { "id" : "98aa4987-d812-4aba-ac20-92d1079f87b2", "name" : "Foo 1", "slug" : "foo-1" }, { "id" : "157569ec-abab-4bfb-b732-55e9c8f4a57d", "name" : "Foo 3", "slug" : "foo-3" } ] } 

The duplicate is deleted with the "duplicated" item saved. This is how you handle the identification and deletion of duplicate data from your collection.

+9
source

Source: https://habr.com/ru/post/1200798/


All Articles