Efficient binary heap implementation

I am looking for information on how to implement binary heaps efficiently. I feel that there should be a good article on how to efficiently implement heaps, but I did not find it. In fact, I have not been able to find any resources regarding the implementation of effective outside of the basics, such as storing the heap in an array. I am looking for methods to create a fast binary heap beyond those that I describe below.

I already wrote an implementation in C ++ that is faster than Microsoft Visual C ++ and GCC std :: priority_queue, or using std :: make_heap, std :: push_heap and std :: pop_heap. Below are the methods that I have already considered in my implementation. I just came up with the last 2, although I doubt that these are new ideas:

(Edit: added section on memory optimization)

Start indexes in 1
Take a look at the Wikipedia implementation notes for binary heaps. If the heap root is placed at index 0, then the formulas for the parent, left, and right-descendants of the node in index n are respectively (n-1) / 2, 2n + 1 and 2n + 2. If you use an array based on 1, then the formulas become simpler n / 2, 2n and 2n + 1. Thus, the parent and left-junior are effective when using an array based on 1. If p points to an array based on 0, and q = p - 1, then we can get access p [0] as q [1], so there is no overhead when using an array based on 1.

Make the move / delete item move to the bottom of the heap before replacing the sheet
Pop on the heap is often described by replacing the top element with the leftmost bottom sheet, and then moving it until the heap property is restored. This requires 2 comparisons per level that we go through, and we will probably go far down the heap since we moved the sheet to the top of the heap. Therefore, we should expect a little less than 2 log n comparisons.

Instead, we can leave a hole in the heap where the top element was. Then we move this hole into the heap, iteratively moving the larger child up. This requires only 1 comparison per level that we pass. Thus, the hole will become a sheet. At this point, we can move the bottom right sheet to the hole position and move this value until the heap property is restored. Since the value we moved was a leaf, we do not expect it to move very far along the tree. Therefore, we should expect a little more than log n comparisons, which is better than before.

Top replacement support
Suppose you want to remove the max element and also insert a new element. Then you can perform any of the delete / pop implementations described above, but instead of moving the bottom right sheet, you use the new value that you want to insert / click. (When most operations of this type, I found that the tournament tree is better than the heap, but otherwise the heap is slightly better.)

Make sizeof (T) power 2
Parent, left-child, and right-child formulas work on indexes and cannot be forced to work directly with pointer values. Thus, we are going to work with indexes, and this means that we evaluate the values ​​of p [i] in the array p from index i. If p is T * and I am an integer, then
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i 

and the compiler must do this calculation to get p [i]. sizeof (T) is a compile-time constant, and multiplication can be done more efficiently if sizeof (T) is a power of two. My implementation was accelerated by adding 8 padding bytes to increase sizeof (T) from 24 to 32. The reduced cache efficiency probably means that this is not a gain for large data sets.

Pre-Multiply Indexes
This was a 23% increase in performance in my dataset. The only thing we do with the index other than searching for the parent, left and right child is to look at the index in the array. Therefore, if we track j = sizeof (T) * i instead of index i, then we could search p [i] without multiplication, which is otherwise implicit in evaluating p [i], since
 &(p[i]) == static_cast<char*>(p) + sizeof(T) * i == static_cast<char*>(p) + j 

Then the formulas of the left-child and the right child for j-values ​​become 2 * j and 2 * j + sizeof (T), respectively. The parent formula is a little more complicated, and I did not find a way to do this otherwise than converting the value of j to i and vice versa:

 parentOnJ(j) = parent(j/sizeof(T))*sizeof(T) == (j/(2*sizeof(T))*sizeof(T) 

If sizeof (T) is a power of 2, then this will compile in 2 shifts. This 1 operation is larger than a regular parent using indices i. However, we save 1 operation when searching. Thus, the net effect is that finding the parent takes the same amount of time this way, while finding the left and right child becomes faster.

Memory optimization

The answers of TokenMacGuy and templatetypedef point to memory-based optimizations that reduce cache misses. For very large data sets or rarely used priority queues, part of the queue can be replaced by OS on OS. In this case, it is worth adding a lot of overhead in order to optimally use the cache, because the replacement from the disk is very slow. My data fits easily into memory and is constantly used, so part of the queue will most likely not be replaced by a disk. I suspect this applies to most priority queue applications.

There are other priority queues that are designed to make better use of the CPU cache. For example, in a 4-heap there should be fewer misses in the cache, and the amount of additional overhead is not so much. LaMarca and Ladner reported in 1996 that they get a 75% performance improvement from matching 4 heaps. However, Hendriks reports in 2010 that:

The implicit heap improvements proposed by Lamarck and Ladner [17] were also tested to improve data locality and reduce cache misses. We implemented a four-way heap that really shows a slightly better consistency than a two-way heap for very distorted input, but only for very large queue sizes. Very large queue sizes are better handled by a hierarchical heap.

Question
Are there more methods than these?
+45
c ++ performance computer-science data-structures priority-queue
Jun 30 2018-11-11T00:
source share
4 answers

An interesting article / article on this topic examines the behavior of caching / paging in the general layout of the heap; The idea was that it would be much more expensive to pay for a missing cache or page than almost any other part of the data structure implementation. The document discusses the heap layout that addresses this.

You do it wrong Poul-Henning Kamp

+7
Jun 30 '11 at 20:14
source share

As the explanation in the @TokenMacGuy post, you might want to look into caching data structures . The idea is to create data structures that, for any caching system, minimize the number of cache misses. They are complex, but in fact they can be useful from your point of view, since they work well even when working with multi-level caching systems (for example, registers / L 1 / L2 / VM).

In fact, a document that describes the optimal queue for caching priorities , which may be of interest. This data structure will have all sorts of advantages in terms of speed, as it will try to minimize the number of cache misses at each level.

+3
Jul 01 '11 at 19:51
source share

I don’t know if you skipped this link on the wiki page for the binary heap, or you decided it wasn’t worth it, but anyway: http://en.wikipedia.org/wiki/B-heap

+1
Jul 02 '11 at 16:50
source share

At the first point: even having a “spare spot” for your array-based implementation is not a waste. In any case, many operations need a temporary element. Instead of initializing a new element each time, it is convenient to have a selected element with index [0].

0
Aug 27 '12 at 16:25
source share



All Articles