NoSQL: Indexing and Keyword Search

I have an application that stores items (like web documents). Each element can have an arbitrary large set of tags. And a typical common request is to get all documents with a given set of tags. Well, a fairly common web application.

Now I think of the NoSQL database as persistent storage. Various NoSQL systems (such as MongoDB) support secondary indexes and perform keyword searches. Examples showing how to do this on different systems are easy to find. The problem is, I would like to know what is happening "under the hood", that is, how / where secondary indexes are stored, and how a query with a list of tags is actually executed. Especially in systems with many nodes.

I know solutions based on Map / Reduce or similar. But here I'm interested in how indexing works. Questions I have, for example:

  • Does the secondary index provide only the identifier of an element or object?
  • If the request contains tags, they are subqueries — one for each tag — and K partial results of the combined initiating node are executed

Where can I find this information for different NoSQL systems? Thanks so much for any tips.

Christian

+4
source share
1 answer

In MongoDB, an index for tags will be executed using a multi-key function, through which the database tries to match documents with each element of the array. You would index this tag attribute for this document, which would create a btree that is built from the tag ranges in this array.

You can learn more about multikeys here and get more information on indexing in MongoDB by looking at this presentation: Internal MongoDB

Does the secondary index provide only the identifier of an element or object?

Indexes consist of an indexed field (let's say this is an array of tags in your case, then this field will be a single tag) and the offset used to efficiently search for a document in memory. It also has some extras + other overhead as described here.

If the query contains k-tags, k subqueries — one for each tag — are executed, and k partial results are combined with the node initiator?

It depends, but if, for example, the request used the $ field or in the tag, I think that the requests are executed in parallel, each in O (log n), and the results are combined to form the result, but I'm not sure about that.

+2
source

Source: https://habr.com/ru/post/1388154/


All Articles