Berkeleydb - B-Tree vs Hash Table

Question

Berkeleydb - B-Tree vs Hash Table

I am trying to figure out what should encourage the choice of access method when using BerkeleyDB: B-Tree versus HashTable. Hashtable provides O (1) lookup, but insert road (using linear / extensible hashing we get O (1) amortization for insert). But B-Trees provide login N (base B) search and insert time. B-Tree can also support range queries and allow access in sorted order.

Beyond these considerations, what else needs to be considered?
If I do not need to support range requests, can I use the Hashtable access method?

+7

hashtable b-tree berkeley-db

rakeshr Nov 09 '10 at 0:12

source share

4 answers

hyc · Answer 1 · 2012-08-23 01:24

When your datasets become very large, B-trees are still better, because most of the internal metadata can still go into the cache. Hashes, by their nature (even random distribution of data), are inherently fuzzy. If the total size of the data set exceeds the size of the working memory, the hash performance drops off a cliff, and the performance of the B-tree gracefully deteriorates (logarithmically, in fact).

superfro · Answer 2 · 2010-11-09 00:29

It depends on your data set and keys. On small datasets, your test will be close to the same, but on large datasets, it may vary depending on what type of keys / how much data you have. Usually b-tree is better until the btree meta tags exceed your cache, and in the end it does a lot of io, in this case the hash is better. Also, as you pointed out, b-tree inserts are more expensive, if you find that you will do many inserts and multiple reads, the hash might be better if you find that you do small inserts, but many reads, b-tree may be better .

If you are really concerned about performance, you can try both methods and run your own tests =]

Metamorphic · Answer 3 · 2015-11-16 01:52

For many applications, access to the database is arbitrary, interactively or with "transactions." This can happen if you have data coming in from a web server. However, you often have to populate a large database right away as a “batch" operation. This can happen if you are running a data analysis project or are migrating an old database to a new format.

When you populate the database right away, a B-Tree or other sortable index is preferable because it allows batch inserts to be done much more efficiently. This is achieved by sorting before placing them in the database. Filling a BerkeleyDB database with 10 million records can take an hour when the records are unsorted, as each access is a cache skip. But when the records are sorted, the same procedure can only take ten minutes. The proximity of the serial keys means that you will use various caches for almost all inserts. Sorting can be done very quickly, so the whole operation could be accelerated several times by sorting the data before inserting it. With hash table indexing, because you don’t know in advance which keys will be next to each other, this optimization is not possible.

Update: I decided to give an example. It is based on after the script " db-test "

 #!/usr/bin/perl use warnings; use strict; use BerkeleyDB; my %hash; unlink "test.db"; tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die; while(<>) { $hash{$_}=1; } untie %hash;

We can test it using the Wikipedia dump index index of 16 million records. (I run it on a dual-core 800 MHz laptop, with 3G memory)

 $ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2 $ wc -l enw.tab 16050432 enw.tab $ du -shL enw.tab 698M enw.tab $ time shuf enw.tab > test-shuf 16.05s user 6.65s system 67% cpu 33.604 total $ time sort enw.tab > test-sort 70.99s user 10.77s system 114% cpu 1:11.47 total $ time ./db-test BerkeleyDB::Btree < test-shuf 682.75s user 368.58s system 42% cpu 40:57.92 total $ du -sh test.db 1.3G test.db $ time ./db-test BerkeleyDB::Btree < test-sort 378.10s user 10.55s system 91% cpu 7:03.34 total $ du -sh test.db 923M test.db $ time ./db-test BerkeleyDB::Hash < test-shuf 672.21s user 387.18s system 39% cpu 44:11.73 total $ du -sh test.db 1.1G test.db $ time ./db-test BerkeleyDB::Hash < test-sort 665.94s user 376.65s system 36% cpu 46:58.66 total $ du -sh test.db 1.1G test.db

You can see that pre-sorting the Btree keys reduces the insertion time from 41 minutes to 7 minutes. Sorting takes only 1 minute, so there is a big net gain - creating a database is 5 times faster. With a hash format, creation time is equally slow, regardless of whether input is sorted or not. Also note that the database file size is smaller for sorted inserts; presumably this is due to tree balancing.

The speedup should be due to some kind of caching, but I'm not sure where. We probably have fewer cache misses in the core of the page cache with sorted inserts. This will correspond to the CPU Usage Number - if the page cache is skipped, then the process must wait until the page is extracted from disk, so the CPU usage is lower.

I did the same tests with two smaller files, for comparison.

 File | WP index | Wikt. words | /usr/share/dict/words Entries | 16e6 | 4.7e6 | 1.2e5 Size | 700M | 65M | 1.1M shuf time | 34s | 4s | 0.06s sort time | 1:10s | 6s | 0.12s ------------------------------------------------------------------------- | total DB CPU | | | time size usage| | ------------------------------------------------------------------------- Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86% sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97% Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98% sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94% ------------------------------------------------------------------------- Speedup | 5x | 2.7x | 2.2x

Using the largest dataset, sorted inserts give us 5x speedup. With the smallest, we still get 2x acceleration - although the data fits easily into memory, so CPU usage is always great. It seems that we are benefiting from a different source of efficiency in addition to the page cache, and that 5x speeding was actually caused in equal parts to the page cache and something else - perhaps better balancing the tree?

In any case, I prefer the Btree format because it allows faster batch operations. Even if the latest database is accessed at a random address, I use batch operations to develop, test and maintain. Life is easier if I can find a way to speed them up.

simbo1905 · Answer 4 · 2015-01-30 21:09

To quote the two main authors of Berkeley DB in this , write about architecture:

The main difference between Btree and Hash access methods is that Btree offers key reference locality, but Hash does not. This implies that Btree is the correct access method for almost all data sets; however, the Hash access method is suitable for datasets, so that even Btree indexes fit into memory. In this case, it is better to use memory for data than for indexing structures. This compromise in 1990 made more sense, memory, as a rule, much less than today.

Thus, it is possible that in embedded devices and specialized cases, the hash table may work. BTree is used in modern file systems such as Btrfs , and it is largely an idea data structure for creating databases or file systems.

Berkeleydb - B-Tree vs Hash Table

More articles: