How to define a primary key field in a Lucene document to get the best search performance?

When creating a document in my Lucene index (v7.2), I add a uid field to it that contains a unique identifier / key (string):

 doc.add(new StringField("uid",uid,Field.Store.YES)) 

To get this document later, I create a TermQuery for this unique identifier and search for it using IndexSearcher:

 searcher.search(new TermQuery(new Term("uid",uid)),1) 

As a newbie to Lucene, I would like to know the following:

  • How can I improve this approach to get the best search performance? Would it, for example, matter if I store the unique identifier as an array of bytes instead of a string? Or are there some special codecs or filters that you can use?

  • What is the time complexity of finding a document by its unique identifier? . Since the index contains at least one unique term for each document, search times will increase linearly with the number of documents (O (n)), right?

+1
source share
1 answer

Theory

There is a blog post about Lucene index term and search performance. It clearly shows all the details of the difficulty of finding a document by id. This post is pretty old, but nothing has changed since then.

Here are some points related to your question:

  • Lucene is a search engine in which the minimum search element is a text term, so this means: binary, numeric and string fields are represented as strings in the Glossary in the block version .
  • In general, the complexity of the search depends on the length of the word: Lucene uses the prefix-trie index structure in memory to perform a term search. Due to the limitations of real hardware and software implementations (to avoid excessive disk reads and memory overflows for extremely large attempts), Lucene uses the BlockTree structure. This means that it stores the prefix-trie in small chunks on disk and loads only one chunk per time. This is why it is so important to create keys in a readable manner. Therefore, let's arrange the factors depending on the degree of their influence:
    • term length - more pieces to load
    • term template - to avoid unnecessary readings
    • count count - reduce the number of pieces

Algorithms and Complexity

Let the term be the only line and let the term dictionary be a large set of terms. If we have a glossary of terms and we need to know if there is one term in the glossary, three (and the minimum determinate acyclic finite state machine (DAFSA) as a subclass) is a data structure that can help us. To your question: “Why use attempts if a hash search can do the same?”, Here are a few reasons:

  • Attempts can find strings in O (L) time (where L denotes the length of one member). This is slightly faster than a hash table in the worst case (a hash table requires linear scanning in case of hash collisions and a complex hash algorithm such as MurmurHash3), or similar to a hash table in the ideal case.
  • Hash tables can only find dictionary words that exactly match the single term we are looking for; whereas trie allows us to find terms that have one different character, a common prefix, a missing character, etc.
  • Trie can provide alphabetical ordering of entries with a key, so we can list all terms in alphabetical order.
  • Trie (and especially DAFSA) provides a very compact representation of terms with deduplication.

Here is an example of DAFSA for three terms: bath, bat and batch: DAFSA data structure example

When searching for a key, note that reducing one level in automata (or trie) is performed in constant time, and each time the algorithm lowers one level in automata (trie), one character is cut out from the term, so we can conclude what to find a term in automata (trie) can be made in O (L) time.

+1
source

Source: https://habr.com/ru/post/1270044/


All Articles