It is usually best to define the complete aggregation pipeline separately from the invocation method and follow the same rules as the structure and indentation that are present in the JSON patterns that you will find and use here.
Thus, it becomes much easier to see where you deviate from the structure:
List<DBObject> pipeline = Arrays.<DBObject>asList( new BasicDBObject("$match",new BasicDBObject("categoryID", 4)), new BasicDBObject("$group", new BasicDBObject("_id", new BasicDBObject("productID","$productID") .append("articleID", "$articleID") .append("colour", "$colour") .append("size", new BasicDBObject("sku","$skuID") .append("size","$size") ) ) ), new BasicDBObject("$group", new BasicDBObject("_id", new BasicDBObject("productID","$_id.productID") .append("articleID", "$_id.articleID") .append("colour", "$_id.colour") ) .append("size",new BasicDBObject("$push","$_id.size") ), new BasicDBObject("$project", new BasicDBObject("_id",0) .append("productID","$_id.productID") .append("colour","$_id.colour") .append("size",1) ) );
Also pay attention to some simplified names here and using $push rather than $addToSet . The latter, as a rule, is because you have already defined unique values ββby including it in the first stage of $group , so $addToSet will not do anything useful here and will actually remove any inherent order from the results that would come from an earlier one, or if you intentionally ordered.
Significantly with this marker, you can, of course, simply reduce it to one $group , since $addToSet performs its own "excellent" operation:
List<DBObject> pipeline = Arrays.<DBObject>asList( new BasicDBObject("$match",new BasicDBObject("categoryID", 4)), new BasicDBObject("$group", new BasicDBObject("_id", new BasicDBObject("productID","$productID") .append("articleID", "$articleID") .append("colour", "$colour") ) .append("size",new BasicDBObject("$addToSet", new BasicDBObject("sku","$skuID") .append("size","$size") ) ) );
As I would also recommend deleting the last $project , since it essentially has to go through all the results and modify all available documents. This is just an addition to the processing, which is usually better handled by the client.
In general, the less the aggregation process takes place, the better, and if something significant does not happen, then another program level will probably handle it better, rather than the database server.