Why does Lucene use arrays instead of hash tables for its inverted index?

Question

Why does Lucene use arrays instead of hash tables for its inverted index?

I watched Adrien Grand talked about the Lucene index architecture , and he does what Lucene uses sorted arrays to represent the vocabulary of his inverted indexes. What is the reason for using sorted arrays instead of hash tables (the "classic" structure of inverted indexes)?

Hash tables provide O (1) input and access capabilities, which I find very useful for quickly processing queries and merging index segments. Sorted arrays, on the other hand, can only offer O (logN) and (gasp) O (N) access, although merging 2 sorted arrays is as complex as joining two hash tables.

The only disadvantages of hash tables that I can imagine are more memory (this can actually be a problem) and less convenience regarding cache memory (although operations such as querying a sorted array require a binary search that doesn't look like to cache).

So what? Lucene developers must have had a very good reason for using arrays. Is it related to scalability? Disk read speed? Is something else completely?

+5

arrays hashtable hashmap indexing lucene

Coconutfred Jul 21 '17 at 4:16

source share

1 answer

Eugene · Accepted Answer · 2017-07-21T09:02:37+0000

Well, I will speculate here (there probably should be a comment - but it will be too long).

HashMap is usually a fast search structure that has a search time of O(1) - a constant value. But this is an average case; since (at least in Java) a HashMap uses TreeNodes - looking for O(logn) inside this bucket. Even if we assume that their search complexity is O(1) , this does not mean that it is the same time. It just means that it is constant for each individual data structure.
Really memory - here I give an example. In short, storing 15_000_000 records will require a little more than 1GB RAM; sorted arrays are probably much more compact, especially since they can contain primitives rather than objects.
Entering entries in the HashMap (usually) requires all keys to be re-hashed, which can be a significant performance hit, as they all must potentially be moved to different places.
There is probably one additional point here - search in ranges, which will require some TreeMap , probably arrays are much more suitable here. I am thinking about partitioning the index (maybe they are doing it internally).
I have the same idea as you - arrays are usually contiguous memory, it may be much easier to be programmed by the processor.
And the last: put me in my place, first I will start with HashMap ... I am sure that there are good reasons for solving them. I wonder if they have actual tests that prove this choice.

Why does Lucene use arrays instead of hash tables for its inverted index?

More articles: