Theory
There is a blog post about Lucene index term and search performance. It clearly shows all the details of the difficulty of finding a document by id. This post is pretty old, but nothing has changed since then.
Here are some points related to your question:
- Lucene is a search engine in which the minimum search element is a text term, so this means: binary, numeric and string fields are represented as strings in the Glossary in the block version .
- In general, the complexity of the search depends on the length of the word: Lucene uses the prefix-trie index structure in memory to perform a term search. Due to the limitations of real hardware and software implementations (to avoid excessive disk reads and memory overflows for extremely large attempts), Lucene uses the BlockTree structure. This means that it stores the prefix-trie in small chunks on disk and loads only one chunk per time. This is why it is so important to create keys in a readable manner. Therefore, let's arrange the factors depending on the degree of their influence:
- term length - more pieces to load
- term template - to avoid unnecessary readings
- count count - reduce the number of pieces
Algorithms and Complexity
Let the term be the only line and let the term dictionary be a large set of terms. If we have a glossary of terms and we need to know if there is one term in the glossary, three (and the minimum determinate acyclic finite state machine (DAFSA) as a subclass) is a data structure that can help us. To your question: “Why use attempts if a hash search can do the same?”, Here are a few reasons:
- Attempts can find strings in O (L) time (where L denotes the length of one member). This is slightly faster than a hash table in the worst case (a hash table requires linear scanning in case of hash collisions and a complex hash algorithm such as MurmurHash3), or similar to a hash table in the ideal case.
- Hash tables can only find dictionary words that exactly match the single term we are looking for; whereas trie allows us to find terms that have one different character, a common prefix, a missing character, etc.
- Trie can provide alphabetical ordering of entries with a key, so we can list all terms in alphabetical order.
- Trie (and especially DAFSA) provides a very compact representation of terms with deduplication.
Here is an example of DAFSA for three terms: bath, bat and batch: 
When searching for a key, note that reducing one level in automata (or trie) is performed in constant time, and each time the algorithm lowers one level in automata (trie), one character is cut out from the term, so we can conclude what to find a term in automata (trie) can be made in O (L) time.
source share