One way to implement HRW hashing a rendezvous in a journal
One way to implement a hashing rendezvous in O (log N), where N is the number of cache nodes:
Each file with the name F is cached in the node cache named C with the largest weight w (F, C), as is usually the case with hashing a rendezvous.
First, we use the non-standard hash function w () something like this:
w (F, C) = h (F) xor h (C).
where h () is some good hash function.
building a tree
For some file named F instead of computing w (F, C) for each cache node - which requires O (N) time for each file - we pre-compute a binary tree based only on the hashed names of the h (C) cache nodes; a tree that allows us to find the node cache with the maximum value of w (F, C) in O (log N) for each file.
Each leaf of the tree contains the name C of one cache node. The root (at a depth of 0) of the tree points to 2 subtrees. All leaves where the most significant bit h (C) is 0 are in the root left subtree; all sheets where the most significant bit h (C) is 1 are in the root right subtree. The two children of the root node (at depth 1) deal with the next most significant bit of h (C). And so on, with internal nodes at depth D engaged in the D'th-most significant bit of h (C). With a good hash function, each step down from the root is approximately half the candidate nodes in the selected subtree, so we get a depth tree of approximately ln_2 N. (If we finish with the tree with "too unbalanced", somehow make everyone agree on some kind of another hash function from some universal hash family to rebuild the tree before adding any files to the cache until we get a tree that is "not too unbalanced").
After the tree has been built, we never need to change it no matter how many F file names we will meet later. We change it only when adding or removing cache nodes from the system.
file name search
For the file name F, which occurs with a hash up to h (F) = 0 (all zero bits), we find the node cache with the highest weight (for this file name), starting from the root and always take the correct subtree whenever possible. If this leads us to an internal node that does not have the correct subtree, then we take its left subtree. Continue until we reach node without left or right subtree, i.e. A node sheet that contains the name of the selected node C cache.
When viewing any other file with the name F, we first get its name to get h (F), then we start from the root and go to the right or left, respectively (if possible), determined by the next bit in h (F), equal to 0 or 1.
Since the tree (by construction) is not "too unbalanced", traversing the entire tree from root to leaf that contains the name of the selected node C cache requires O (ln N) time in the worst case.
We expect that for a typical set of file names, the hash function h (F) "randomly" selects left or right at each depth of the tree. Since the tree (by construction) is not “too unbalanced”, we expect that each physical node cache will cache approximately the same number of files (within a few of 4 or so).
outlier effects
When some kind of physical node cache fails, each removes the corresponding node leaf from its copy of this tree. (Each also removes each interior node, which then has no descendants of leaves). This does not require moving through any files cached in any other node cache - they are still mapped to the same node cache that they always did. (The rightmost node leaf in the tree is still the largest node leaf in this tree, no matter how many other nodes in this tree are deleted).
For instance,
.... \ | / \ | | / / \ | X | / \ / \ VWYZ
Using this O (log N) algorithm, when the node X cache dies, the leaf X is deleted from the tree and all its files become (hopefully relatively evenly) distributed between Y and Z - none of the files from X end with V or W or in any other node cache. All files that previously went to the cache nodes V, W, Y, Z continue to work in the same cache nodes.
rebalancing after dropout
Many missing cache nodes or adding new cache nodes or both can cause the tree to be "too unbalanced." Choosing a new hash function is a big problem after we added a bunch of files to the cache, so instead of choosing a new hash function, as it was when we first built the tree, it might be better to rebalance the tree somehow nodes, rename them with some new semi-random names and then add them back to the system. Repeat until the system is no longer too unbalanced. (Start with the most unbalanced nodes — nodes that cache the least amount of data).
comments
ps: I think this may be pretty close to what mcdowella was thinking, but with more details to clarify that (a) yes, it is log (N), because it is a binary tree that is "not too unbalanced ", (b) it does not have" replicas "and (c) when one node cache is not executed, it does not require reassignment of files that were not in this node cache.
pps: I am sure that on the Wikipedia page it is wrong to imply that typical rendezvous hashing implementations occur at O (log N) time, where N is the number of cache nodes. It seems to me (and I suspect that the original hash designers also) that the time spent (internally, without communication) recalculating the hash against each node in the network will be negligible and there is no need to worry about compared to the time it takes to extract data from some remote node cache.
I understand that rendezvous hashing is almost always implemented using a simple linear algorithm that uses O (N) time, where N is the number of cache nodes, each time we get a new file name F and we want to select the node cache for this file.
Such a linear algorithm has the advantage that it can use a “better” hash function than the previous xor-based w (), so when some kind of physical node cache dies, all the files that were currently cached are expected that dead nodes will be evenly distributed between all other nodes.