Find all duplicate documents in MongoDB collection by key field

Suppose I have a collection with some set of documents. something like that.

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"} { "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"} 

I want to find all duplicate entries in this collection by the "name" field. For example. "foo" appears twice, and "bar" appears 3 times.

+45
duplicates mongodb mapreduce aggregation-framework
Feb 29 2018-12-12T00:
source share
4 answers

Note: this solution is easiest to understand, but not the best.

You can use mapReduce to find out how many times a document contains a specific field:

 var map = function(){ if(this.name) { emit(this.name, 1); } } var reduce = function(key, values){ return Array.sum(values); } var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}}); db[res.result].find({value: {$gt: 1}}).sort({value: -1}); 
+15
Feb 29 2018-12-12T00: 00Z
source share

The accepted answer is terribly slow in large collections and does not return _id duplicate entries.

Aggregation is much faster and can return _id s:

 db.collection.aggregate([ { $group: { _id: { name: "$name" }, // replace `name` here twice uniqueIds: { $addToSet: "$_id" }, count: { $sum: 1 } } }, { $match: { count: { $gte: 2 } } }, { $sort : { count : -1} }, { $limit : 10 } ]); 

At the first stage of the $ group aggregation pipeline, the operator aggregates the documents in the name field and stores in uniqueIds each value _id grouped records. The $ sum operator adds the values โ€‹โ€‹of the fields passed to it, in this case constant 1 - thereby counting the number of grouped entries in the count field.

In the second stage of the pipeline, we use $ match to filter documents with count at least 2, i.e. duplicates.

Then we first sort the most common duplicates and limit the results to the top 10.

This query will output up to $limit entries with duplicate names along with their _id s. For example:

 { "_id" : { "name" : "Toothpick" }, "uniqueIds" : [ "xzuzJd2qatfJCSvkN", "9bpewBsKbrGBQexv4", "fi3Gscg9M64BQdArv", ], "count" : 3 }, { "_id" : { "name" : "Broom" }, "uniqueIds" : [ "3vwny3YEj2qBsmmhA", "gJeWGcuX6Wk69oFYD" ], "count" : 2 } 
+129
Aug 12 '13 at 2:00
source share

For a general Mongo solution, see the MongoDB Cookbook recipe for finding duplicates using group . Note that aggregation is faster and more powerful, as it can return _id duplicate records.

For pymongo , the accepted answer (using mapReduce) is not so efficient. Instead, we can use the group method:

 $connection = 'mongodb://localhost:27017'; $con = new Mongo($connection); // mongo db connection $db = $con->test; // database $collection = $db->prb; // table $keys = array("name" => 1); Select name field, group by it // set intial values $initial = array("count" => 0); // JavaScript function to perform $reduce = "function (obj, prev) { prev.count++; }"; $g = $collection->group($keys, $initial, $reduce); echo "<pre>"; print_r($g); 

The output will be as follows:

 Array ( [retval] => Array ( [0] => Array ( [name] => [count] => 1 ) [1] => Array ( [name] => MongoDB [count] => 2 ) ) [count] => 3 [keys] => 2 [ok] => 1 ) 

An equivalent SQL query will look like this: SELECT name, COUNT(name) FROM prb GROUP BY name . Note that we still need to filter out elements with the number 0 from the array. Again, refer to the MongoDB cookbook recipe for duplicates using group for a canonical solution using group .

+5
Feb 11 '13 at 5:16
source share
+2
Oct. 14 '15 at 8:38
source share



All Articles