Hash table versus balanced binary tree

What factors should be considered when I need to choose between a hash table or a balanced binary tree to implement a collection or associative array?

+45
language-agnostic algorithm data-structures hash tree
Jan 31 '11 at 0:00
source share
11 answers

This question cannot be answered, in general, I'm afraid.

The problem is that there are many types of hash tables and balanced binary trees, and their characteristics vary greatly.

So, the naive answer is: it depends on the required functionality. Use a hash table if you don't need ordering and a balanced binary tree otherwise.

For a more detailed answer, consider some alternatives.

Hash table (see Wikipedia entry for some basics)

  • Not all hash tables use a linked list like a bucket. A popular alternative is to use a “better” bucket, such as a binary tree or another hash table (with a different hash function), ...
  • Some hash tables do not use buckets at all: see the section “Open Addressing” (they come with other problems, obviously)
  • There is something called Linear Rehashing (this is the quality of implementation granularity) that avoids the stop-the-world-and-rehash trap. Basically, at the migration stage, you insert only the “new” table, and also move one “old” record to the “new” table. Of course, the migration phase means a double search, etc.

Binary tree

  • Rebalancing is expensive, you might consider Skip-List (also better for multi-threaded accesses) or Splay Tree.
  • A good allocator can pack nodes together in memory (better caching behavior), although this does not alleviate the problem of finding a pointer.
  • B-Tree and options also offer "packaging"

Do not forget that O (1) is an asymptotic complexity. For several elements, the coefficient is usually more important (in terms of performance). Which is especially true if your hash function is slow ...

Finally, for sets, you can also consider probabilistic data structures such as Bloom Filters .

+48
Jan 31 '11 at 12:24
source share

Hash tables are usually better if there is no need to store data in any order. Binary trees are best if the data needs to be sorted.

+39
Jan 31 2018-11-11T00:
source share

A worthy moment in modern architecture: a Hash table usually, if its load factor is low, has less memory than a binary tree. Because memory access tends to be quite expensive compared to burning CPU cycles, the hash table is often faster.

The following binary tree assumes that it is self-balancing, such as red ebony, AVL tree, or like a treap.

On the other hand, if you need to rephrase everything in a hash table when you decide to extend it, this can be a costly operation that arises (is amortized). Binary trees do not have this limitation.

Binary trees are easier to implement in purely functional languages.

Binary trees have a natural sort order and a natural way to walk the tree for all elements.

When the load factor in the hash table is low, you can waste a lot of space on memory, but with two pointers, binary trees tend to take up more space.

The hash tables are almost equal to O (1) (depending on how you handle the load factor) and Bin tree O (log n).

Trees are usually the "average performer." They don’t do anything special, but then they don’t do anything special.

+11
Jan 31 2018-11-11T00:
source share

Hash tables can be viewed faster:

  • You need a key that generates a uniform distribution (otherwise, you will miss a lot and you will have to rely on something other than a hash, such as a linear search).
  • A hash can use a lot of white space. You can reserve 256 entries, but only 8 (for now).

Binary trees:

  • Deterministic. O (log n) I think ...
  • No extra space needed, such as hash tables
  • Must be sorted. Adding an element to the middle means moving the rest around.
+6
Jan 31 2018-11-11T00:
source share

A binary search tree requires a common order relationship between keys. A hash table only requires an equivalence or identity relationship with a sequential hash function.

If a general relationship order is available, then the sorted array has search performance comparable to binary trees, worst-case hash-table insertion, and less complexity and memory usage than both.

In the worst case, the complexity of inserting a hash table can be left at O ​​(1) / O (log K) (with K the number of elements with the same hash), if this is acceptable to increase the search complexity in the worst case to O (K) or O (log K) if items can be sorted.

Invariants for trees and hash tables are expensively restored if the keys are changed, but less than O (n log N) for sorted arrays.

These are factors to consider when deciding which implementation to use:

  • Accessibility is a complete order relationship.
  • Having a good hash function for the equivalence relation.
  • Preliminary knowledge of the number of elements.
  • Knowledge of insertion, deletion and search speed.
  • The relative complexity of the comparison and hash functions.
+6
Jan 31 '11 at 18:04
source share

If you only need to access individual elements, hash tables are better. If you need a number of elements, you simply have no other option than binary trees.

+3
Jan 31 2018-11-11T00:
source share

To add to the other great answers above, I would say:

Use a hash table if the amount of data does not change (for example, storing constants); but if the amount of data changes, use a tree. This is because in the hash table, after the load factor is reached, the hash table must change. Resizing operation can be very slow.

+3
Jan 31 '11 at 16:18
source share

One issue that, it seems to me, has not been addressed is that trees are much better suited for persistent data structures. That is, immutable structures. A standard hash table (i.e., one that uses one array of linked lists) cannot be modified without changing the entire table. One of the situations in which this is relevant is that two parallel functions have a copy of the hash table, and one of them modifies the table (if the table is mutable, this change will also be visible to the other). Another situation would be something like this:

def bar(table): # some intern stuck this line of code in table["hello"] = "world" return table["the answer"] def foo(x, y, table): z = bar(table) if "hello" in table: raise Exception("failed catastrophically!") return x + y + z important_result = foo(1, 2, { "the answer": 5, "this table": "doesn't contain hello", "so it should": "be ok" }) # catastrophic failure occurs 

With a modified table, we cannot guarantee that the table obtained by the function call will remain that table during its execution, as other function calls can change it.

Thus, volatility is sometimes not a pleasant thing. Now, the way around this would be for the table to be immutable, and updates will return the new table without changing the old one. But with a hash table, this will often be an expensive O (n) operation, since the entire base array will need to be copied. On the other hand, with a balanced tree, a new tree can be generated using only O (log n) nodes that must be created (the rest of the tree is identical).

This means that an efficient tree can be very convenient when constant maps are required.

+2
Nov 15 '13 at 22:22
source share

If you have several slightly different instances of sets, you probably want them to share the structure. This is easy with trees (if they are immutable or are copied for writing). I'm not sure how well you can do this with hashtables; this is at least less obvious.

+1
Jan 31 2018-11-11T00:
source share

In my experience, hastables are always faster because the trees suffer too many cache effects.

To see some real data, you can check the control page of my TommyDS library http://tommyds.sourceforge.net/

Here you can see a comparison of the performance of the most common hash table, tree and three libraries.

+1
Feb 05 '11 at 16:16
source share

One point for notes is about workaround, minimum, and maximum elements. Hash tables do not support any sort of crawl or access to minimum or maximum elements. If these features are important, a binary tree is the best choice.

0
May 31 '16 at 6:52 a.m.
source share



All Articles