I watched Adrien Grand talked about the Lucene index architecture , and he does what Lucene uses sorted arrays to represent the vocabulary of his inverted indexes. What is the reason for using sorted arrays instead of hash tables (the "classic" structure of inverted indexes)?
Hash tables provide O (1) input and access capabilities, which I find very useful for quickly processing queries and merging index segments. Sorted arrays, on the other hand, can only offer O (logN) and (gasp) O (N) access, although merging 2 sorted arrays is as complex as joining two hash tables.
The only disadvantages of hash tables that I can imagine are more memory (this can actually be a problem) and less convenience regarding cache memory (although operations such as querying a sorted array require a binary search that doesn't look like to cache).
So what? Lucene developers must have had a very good reason for using arrays. Is it related to scalability? Disk read speed? Is something else completely?
source share