We are building a simplified version of the search engine on top of MongoDB.
Data set example
{ "_id" : 1, "dept" : "tech", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" } { "_id" : 2, "dept" : "tech", "updDate": ISODate("2014-07-27T09:45:35Z"), "description" : "wireless red mouse" } { "_id" : 3, "dept" : "kitchen", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat" } { "_id" : 4, "dept" : "kitchen", "updDate": ISODate("2014-05-27T09:45:35Z"), "description" : "red peeler" } { "_id" : 5, "dept" : "food", "updDate": ISODate("2014-04-27T09:45:35Z"), "description" : "green apple" } { "_id" : 6, "dept" : "food", "updDate": ISODate("2014-01-27T09:45:35Z"), "description" : "red potato" } { "_id" : 7, "dept" : "food", "updDate": ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer" } { "_id" : 8, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" } { "_id" : 9, "dept" : "food", "updDate": ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer" }
We want to avoid using the "offset-limit" to split the results, to do this, we mainly use the "search method" by modifying the "where / match" sentence of the query to be able to use the index instead of repeating through the collection to get the desired results. For more information on the βsearch method,β I highly recommend reading http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way
Search engines usually order results by date and date of update in descendant order. To do this, we use the text search function in the aggregation pipeline as follows.
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
First page
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, {$limit: 2 }] ) { "_id" : 5, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green apple", "score" : 0.75 } { "_id" : 3, "updDate" : ISODate("2014-04-27T09:45:35Z"), "description" : "green placemat", "score" : 0.75 }
Second page
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]},"$text" : { "$language" : "en", "$search" : "green"} } },{ $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.75}} , { "$and" : [ { "score" : { "$eq" : 0.75}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-04-27T09:45:35Z")}},{ "$and" : [ { "updDate": { "$eq" : ISODate("2014-04-27T09:45:35Z")}} , { "_id" : { "$lt" : 3}}]}]}]}]}},{$limit: 2 }] ) { "_id" : 7, "updDate" : ISODate("2014-08-28T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 } { "_id" : 9, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
And the last page
db.inventory.aggregate( [ { $match: { dept : {$in : ["food","kitchen"]} , "$text" : { "$language" : "en", "$search" : "green"} } }, { $project: {score: { $meta: "textScore" }, description : 1, updDate : 1, _id: 1 } }, { $sort: { "score" : -1, "updDate" : -1, _id: -1 } }, { "$match" : { "$or" : [ { "score" : { "$lt" : 0.6666666666666666}} , { "$and" : [ { "score" : { "$eq" : 0.6666666666666666}} , { "$or" : [ { "updDate" : { "$lt" : ISODate("2014-08-27T09:45:35Z")}} , { "$and" : [ { "updDate" : { "$eq" : ISODate("2014-08-27T09:45:35Z")}} , { "_id" : { "$lt" : 9}}]}]}]}]}}, {$limit: 2 }] ) { "_id" : 8, "updDate" : ISODate("2014-08-27T09:45:35Z"), "description" : "lime green computer", "score" : 0.6666666666666666 }
Notice how we order results using score, updDate and id, and in the second phase of the match we try to break them into pages using the value of the documentβs rating, date of update, and finally, the identifier.
Creating an index takes into account that text queries cannot be covered in text index prefixes, see the question https://jira.mongodb.org/browse/SERVER-13018 , although I'm not sure if this applies to our case.
Since the "executeStats" and "allPlansExecution" modes do not work as part of the aggregation, see https://jira.mongodb.org/browse/SERVER-19758 . I do not know how MongoDB is trying to resolve the request.
Since index intersection does not work for text search, see https://jira.mongodb.org/browse/SERVER-3071 (allowed in 2.5.5) and http://blog.mongodb.org/post/87790974798/efficient- indexing-in-mongodb-26 , where the author says that
As of version 2.6.0, you cannot intersect with geo or text indices and you can intersect at most 2 separate indices with each other. These limitations are likely to change in a future release.
After reading sections 3.4 (Text Search Tutorials) and 3.5 (Indexing Strategies) https://docs.mongodb.org/manual/MongoDB-indexes-guide-master.pdf several times, without receiving a clear conclusion.
So what is the best indexing strategy for indexing this collection in terms of text search?
One index for the first phase of the match, and the second for the second phase of the match (pagination)?
db.inventory.createIndex({description:"text", dept: -1}) db.inventory.createIndex({updDate: -1, id:-})
Compound index taking into account fields from both phases of correspondence?
db.inventory.createIndex({description:"text", dept: -1, updDate: -1, id:-1})
None of the above?
thanks