What data structure should I use to store hash values?

I have a hash table that I want to save to disk. The list is as follows:

<16-byte key                   > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...

There are 1-5 million records. Currently, I just store them in one file, 17 bytes per record, number of records. This file has tens of megabytes. My goal is to save them in such a way as to optimize disk space first, and then look for the search time. The insertion time does not matter.

What is the best way to do this? I would like the file to be as small as possible. Multiple files will be fine too. Patricia three? Radix trie?

No matter what good offers I receive, I will implement and test. I will post the results here for everyone to see.

+3
source share
6 answers

You can simply sort the entries by key and perform a binary search.

Fixed size keys and data records mean that you can go from line to line very quickly, and storing only the key and data means that you don’t waste the metadata space.

I don’t think you will do better on disk, and the search time is O (log (n)). The insertion time is very crazy, but you said it doesn't matter.

, , . * / . , , , . , .

(, GZIP), ; , , .

, , , -, . , . , .

* .

+4

5 81 - .

- , -. - ( ).

, - .

(, ).

:

  • -, .
  • ,
  • , .
  • :
    4.1.
    4.2.
    4,3. , , , , .

, :

  • ,
  • -
  • -
  • 4.3. , , . .

- .

+3

sqlite? , , , .

+1

- , - - ~ 100 - , 2 .

- - , ASC . , entryNumber * 17, , , -, ~ log2 (entriesNumber), . " " , , - . .. , log2 (entriesNumber) .

+1

, ( ) , . , 16- , , - - , :

  • , , ; , , , ;

  • , shebang ;

  • 4 16 ^ 4 (= 65536); 5x10 ^ 6 , 76 ; , , 100 ; :

  • 0 4 0x0000; pad 100 (1700 , ) 0s;

  • 1700 4 0x0001, pad,

  • , .

, , 100 , , . , 16 ^ 5 , 6 (6x16 ^ 5 = 6291456). , , , .

- , , , (a) (b), ( ).

, , , 4 , .

, , , - - .

+1
source

Your key is 128 bits, but if you have a maximum of 10 ^ 7 records, it only takes 24 bits to index it.

  • You can create a hash table or

  • Use expandable binary search in Bentley style (no more than 24 comparisons), as in

A loop is deployed here (with 32-bit ints).

int key[4];
int a[1<<24][4];

#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])

i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
+1
source

Source: https://habr.com/ru/post/1726521/


All Articles