I am looking for information on how to implement binary heaps efficiently. I feel that there should be a good article on how to efficiently implement heaps, but I did not find it. In fact, I have not been able to find any resources regarding the implementation of effective outside of the basics, such as storing the heap in an array. I am looking for methods to create a fast binary heap beyond those that I describe below.
I already wrote an implementation in C ++ that is faster than Microsoft Visual C ++ and GCC std :: priority_queue, or using std :: make_heap, std :: push_heap and std :: pop_heap. Below are the methods that I have already considered in my implementation. I just came up with the last 2, although I doubt that these are new ideas:
(Edit: added section on memory optimization)
Start indexes in 1Take a look at the
Wikipedia implementation notes for binary heaps. If the heap root is placed at index 0, then the formulas for the parent, left, and right-descendants of the node in index n are respectively (n-1) / 2, 2n + 1 and 2n + 2. If you use an array based on 1, then the formulas become simpler n / 2, 2n and 2n + 1. Thus, the parent and left-junior are effective when using an array based on 1. If p points to an array based on 0, and q = p - 1, then we can get access p [0] as q [1], so there is no overhead when using an array based on 1.
Make the move / delete item move to the bottom of the heap before replacing the sheetPop on the heap is often described by replacing the top element with the leftmost bottom sheet, and then moving it until the heap property is restored. This requires 2 comparisons per level that we go through, and we will probably go far down the heap since we moved the sheet to the top of the heap. Therefore, we should expect a little less than 2 log n comparisons.
Instead, we can leave a hole in the heap where the top element was. Then we move this hole into the heap, iteratively moving the larger child up. This requires only 1 comparison per level that we pass. Thus, the hole will become a sheet. At this point, we can move the bottom right sheet to the hole position and move this value until the heap property is restored. Since the value we moved was a leaf, we do not expect it to move very far along the tree. Therefore, we should expect a little more than log n comparisons, which is better than before.
Top replacement supportSuppose you want to remove the max element and also insert a new element. Then you can perform any of the delete / pop implementations described above, but instead of moving the bottom right sheet, you use the new value that you want to insert / click. (When most operations of this type, I found that the tournament tree is better than the heap, but otherwise the heap is slightly better.)
Make sizeof (T) power 2Parent, left-child, and right-child formulas work on indexes and cannot be forced to work directly with pointer values. Thus, we are going to work with indexes, and this means that we evaluate the values of p [i] in the array p from index i. If p is T * and I am an integer, then
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i
and the compiler must do this calculation to get p [i]. sizeof (T) is a compile-time constant, and multiplication can be done more efficiently if sizeof (T) is a power of two. My implementation was accelerated by adding 8 padding bytes to increase sizeof (T) from 24 to 32. The reduced cache efficiency probably means that this is not a gain for large data sets.
Pre-Multiply IndexesThis was a 23% increase in performance in my dataset. The only thing we do with the index other than searching for the parent, left and right child is to look at the index in the array. Therefore, if we track j = sizeof (T) * i instead of index i, then we could search p [i] without multiplication, which is otherwise implicit in evaluating p [i], since
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i == static_cast<char*>(p) + j
Then the formulas of the left-child and the right child for j-values become 2 * j and 2 * j + sizeof (T), respectively. The parent formula is a little more complicated, and I did not find a way to do this otherwise than converting the value of j to i and vice versa:
parentOnJ(j) = parent(j/sizeof(T))*sizeof(T) == (j/(2*sizeof(T))*sizeof(T)
If sizeof (T) is a power of 2, then this will compile in 2 shifts. This 1 operation is larger than a regular parent using indices i. However, we save 1 operation when searching. Thus, the net effect is that finding the parent takes the same amount of time this way, while finding the left and right child becomes faster.
Memory optimizationThe answers of TokenMacGuy and templatetypedef point to memory-based optimizations that reduce cache misses. For very large data sets or rarely used priority queues, part of the queue can be replaced by OS on OS. In this case, it is worth adding a lot of overhead in order to optimally use the cache, because the replacement from the disk is very slow. My data fits easily into memory and is constantly used, so part of the queue will most likely not be replaced by a disk. I suspect this applies to most priority queue applications.
There are other priority queues that are designed to make better use of the CPU cache. For example, in a 4-heap there should be fewer misses in the cache, and the amount of additional overhead is not so much. LaMarca and Ladner reported in 1996 that they get a 75% performance improvement from matching 4 heaps. However, Hendriks reports in 2010 that:
The implicit heap improvements proposed by Lamarck and Ladner [17] were also tested to improve data locality and reduce cache misses. We implemented a four-way heap that really shows a slightly better consistency than a two-way heap for very distorted input, but only for very large queue sizes. Very large queue sizes are better handled by a hierarchical heap.
QuestionAre there more methods than these?