The fastest way to remove duplicate documents in mongodb

I have about 1.7M documents in mongodb (in the future 10m +). Some of them are duplicate entries that I do not want. The structure of the document looks something like this:

{ _id: 14124412, nodes: [ 12345, 54321 ], name: "Some beauty" } 

A document is duplicated if it has at least the same node as another document with the same name . What is the fastest way to remove duplicates?

+32
performance optimization duplicates mongodb
Jan 6 '13 at 16:23
source share
8 answers

Assuming you want to permanently delete documents containing the duplicate name + nodes entry from the collection, you can add a unique index using dropDups: true :

 db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true}) 

As the docs say, use extreme caution with it, as it will delete data from your database. Back up your database first if it doesn’t do what you expect.

UPDATE

This solution only works through MongoDB 2.x, since the dropDups option is dropDups longer available in version 3.0 ( docs ).

+42
Jan 06 '13 at 17:00
source share

dropDups: true option is not available in version 3.0.

I have a solution with the basis of aggregation to collect duplicates and then delete at a time.

This may be slightly slower than changing the index index of the system. But it’s good considering the way to remove duplicate documents.

but. Delete all documents in one go

 var duplicates = []; db.collectionName.aggregate([ { $match: { name: { "$ne": '' } // discard selection criteria }}, { $group: { _id: { name: "$name"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting doc.dups.forEach( function(dupId){ duplicates.push(dupId); // Getting all duplicate ids } ) }) // If you want to Check all "_id" which you are deleting else print statement not needed printjson(duplicates); // Remove all duplicates in one go db.collectionName.remove({_id:{$in:duplicates}}) 

b. You can delete documents one at a time.

 db.collectionName.aggregate([ // discard selection criteria, You can remove "$match" section if you want { $match: { source_references.key: { "$ne": '' } }}, { $group: { _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }}, { $match: { count: { "$gt": 1 } // Duplicates considered as count greater than one }} ], {allowDiskUse: true} // For faster processing if set is larger ) // You can display result until this and check duplicates .forEach(function(doc) { doc.dups.shift(); // First element skipped for deleting db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates }) 
+61
Oct 27 '15 at 9:38
source share

Dumping a collection using mongodump

Clear collection

Add a unique index

Restore a collection using mongorestore

+22
Jul 01 '16 at 6:42
source share

I found this solution that works with MongoDB 3.4: I assume the duplicate field is called fieldX

 db.collection.aggregate([ { // only match documents that have this field // you can omit this stage if you don't have missing fieldX $match: {"fieldX": {$nin:[null]}} }, { $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}} }, { $replaceRoot: { "newRoot": "$doc"} } ], {allowDiskUse:true}) 

As a newbie to mongoDB, I spent a lot of time and used other long solutions to find and remove duplicates. However, I think this solution is neat and easy to understand.

It works by first matching documents that contain fieldX (I had several documents without this field, and I got one additional empty result).

The next step groups the documents by fieldX and inserts only the $ first document into each group using $$ ROOT . Finally, it replaces the entire aggregated group with the document found using $ first and $$ ROOT.

I had to add allowDiskUse because my collection is large.

You can add this after any number of pipelines, and although the documentation for $ first mentions the sorting step before using $ first , it worked for me without it. "I can’t post the link here, my reputation is less than 10 :("

You can save the results in a new collection by adding the $ out stage ...

Alternatively , if someone is interested in only a few fields, for example, field1, field2, and not the whole document, at the group stage without replaceRoot:

 db.collection.aggregate([ { // only match documents that have this field $match: {"fieldX": {$nin:[null]}} }, { $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }} } ], {allowDiskUse:true}) 
+7
Jun 13 '17 at 13:13
source share

You can do something like this if you are trying to do it in pymongo.

 def _run_query(): try: for record in (aggregate_based_on_field(collection)): if not record: continue _logger.info("Working on Record %s", record) try: retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1})) _logger.info("_id to retain from duplicates %s", retain['_id']) db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}}) except Exception as ex: _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex)) except Exception as e: _logger.error("Mongo error when deleting duplicates %s", str(e)) def aggregate_based_on_field(collection): return collection.aggregate([{'$group' : {'_id': "$fieldX"}}]) 

From the shell:

  • Replace find_one for findOne
  • The same delete command should work.
0
Nov 30 '17 at 1:49 on
source share

The following method combines documents with the same name, preserving only unique nodes without duplicating them.

I found that using the $out operator is an easy way. I untwist the array and then group it by adding to the set. The $out operator allows you to save the aggregation result [docs] . If you enter a name for the collection itself, it will replace the collection with new data. If the name does not exist, it will create a new collection.

Hope this helps.

allowDiskUse may need to be added to the pipeline.

 db.collectionName.aggregate([ { $unwind:{path:"$nodes"}, }, { $group:{ _id:"$name", nodes:{ $addToSet:"$nodes" } }, { $project:{ _id:0, name:"$_id.name", nodes:1 } }, { $out:"collectionNameWithoutDuplicates" } ]) 
0
Jan 21 '19 at 5:59
source share

When using pymongo this should work.

Add fields that must be unique to the collection in unique_field

 unique_field = {"field1":"$field1","field2":"$field2"} cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True) 

chop an array of dups depending on the number of duplicates (here I had only one additional duplicate for everyone)

 items = list(cursor) removeIds = items[0]['dups'] hold.remove({"uuid":{"$in":removeIds}}) 
0
Sep 04 '19 at 13:47 on
source share

Here are some more “manual” ways to do this:

In fact, first get a list of all the unique keys that you are interested in.

Then do a search using each of these keys and delete if this search returns more than one.

  db.collection.distinct("key").forEach((num)=>{ var i = 0; db.collection.find({key: num}).forEach((doc)=>{ if (i) db.collection.remove({key: num}, { justOne: true }) i++ }) }); 
-one
Aug 23 '17 at 12:42 on
source share



All Articles