Keeping HRV in a journal?

The Wikipedia page for Hashing Rendezvous (maximum random weight "HRW") makes the following statement:

Although at first it might seem that the HRW algorithm works in O (n) time, it is not. Sites can be organized hierarchically, and HRW is applied at every level when you omit the hierarchy, which leads to O (log n) runtime, as in. [7]

I received a copy of the reference document "Hash-based virtual hierarchies for the scalable location service on ad hoc mobile networks." However, the hierarchy mentioned in their document seems very specific to their field of application. As far as I can tell, there is no clear indication of how to generalize the method. Wikipedia remark shows that in general .

I looked at a few common HRW implementations, and none of them seemed to support anything better than linear time. I thought about this, but I don’t see the possibility of organizing sites hierarchically without causing the parent nodes to cause an inefficient reassignment when they drop out, which is far ahead of the main advantage of HRW.

Does anyone know how to do this? Alternatively, is Wikipedia incorrect in that there is a general way to implement this during a magazine?

Edit: mcdowella research approach:

OK, I think I see how this might work. But you need a little more than you indicated.

If you just do what you described, you are in a situation where each leaf probably has only zero or one node in it, and there is a significant difference in how many nodes are in most subtrees with leaves. If you change with HRW at each level, just by doing all this with a normal search tree, you get exactly the same effect. In essence, you have an implementation of sequential hashing, as well as its lack of unequal load between buckets. The calculation of combined weights, the definition of HRW implementation, does not add anything; you better just do a search at each level, since it saves hashing and can be implemented without looping over each radix value

This can be fixed: you just need to use HRW to select from many alternatives at the final level. That is, you need all the leaf nodes to be in large buckets, comparable to the number of replicas you would have in sequential hashing. These large buckets should be approximately equally loaded compared to each other, and then you use HRW to select a specific site. Since the bucket sizes are fixed, this is O (n), and we get all the key properties of HRW.

Honestly, though, I think this is pretty dubious. This is not so much an implementation of HRW as a simple combination of HRW with sequential hashing. I suppose there is nothing wrong with this, and it can be even better than the usual replica technique, in some cases. But I find it erroneous to say that HRW is log (n) if it is really what the author had in mind.

In addition, the original description is also in doubt. You do not need to apply HRW at every level, and you should not, since there is no advantage in this; you have to do something quick (like indexing) and just use HRW for the final selection.

Is this really the best we can do, or is there another way to do HRW O (log (n))?

+5
source share
4 answers

If you give each site a sufficiently long random identifier expressed in radix k (possibly by hashing a non-random id), you can associate sites with leaves of a tree that has no more than k descendants on each node. There is no need to associate any site with an internal tree node.

To determine where to store an item, use HRW to process from the root of the tree, through which you can branch out on the tree nodes, stopping when you reach the sheet that is associated with the site. You can do this without contacting any site until you figure out which site you want to save the item on - all you need to know is the hashed site identifiers for building the tree.

Since the sites are connected only with leaves, there is no way for the internal node of the tree to fall out, except that all sites associated with the leaves under it fall out, and at that moment it will become irrelevant.

+1
source

I do not buy an updated answer. There are two good HRW features that seem to be lost when comparing branch weights instead of all sites.

First, you can select top-n sites, not just primary ones, and they should be distributed randomly. If you go down into one tree, then the top Russian sites will be next to each other in the tree. This can be fixed by descent several times with different salts, but this seems like a lot of extra work.

Secondly, it is obvious what happens when a site is added or removed and only 1 / | sites | data is moved if added. If you modify an existing tree, it affects only the peer site. In the case of adding, the only data that is being moved refers to the new partner of the added site. In the case of deletion, all the data that was on this site is now moved to the previous peer. If you recompile the tree instead, all data may move, depending on how you build the tree.

+1
source

I think you can use the same "virtual node" approach that is commonly used for sequential hashing. Suppose you have N physical nodes with identifiers:

{n1,...,nN}. 

Select V, the number of virtual nodes per physical node, and generate a new list of identifiers:

 {n1v1,v1v2,...,n1vV ,n2v1,n2v2,...,n2vV ,... ,nNv1,nNv2,...,nNvV}. 

Arrange them in sheets of a fixed but randomized binary tree with labels on internal nodes. These internal labels can be, for example, the concatenation of the labels of its child nodes.

To select a physical node to store the O at object, start from the root and select a branch with a higher hash H (label, O). Repeat the process until you reach the sheet. Store the object in the physical node corresponding to the virtual node on this sheet. This requires the steps O (log (NV)) = O (log (N) + log (V)) = O (log (N)) (since V is constant).

If the physical node fails, the objects in this node are redrawn, skipping subtrees without active leaves.

0
source

One way to implement HRW hashing a rendezvous in a journal

One way to implement a hashing rendezvous in O (log N), where N is the number of cache nodes:

Each file with the name F is cached in the node cache named C with the largest weight w (F, C), as is usually the case with hashing a rendezvous.

First, we use the non-standard hash function w () something like this:

w (F, C) = h (F) xor h (C).

where h () is some good hash function.

building a tree

For some file named F instead of computing w (F, C) for each cache node - which requires O (N) time for each file - we pre-compute a binary tree based only on the hashed names of the h (C) cache nodes; a tree that allows us to find the node cache with the maximum value of w (F, C) in O (log N) for each file.

Each leaf of the tree contains the name C of one cache node. The root (at a depth of 0) of the tree points to 2 subtrees. All leaves where the most significant bit h (C) is 0 are in the root left subtree; all sheets where the most significant bit h (C) is 1 are in the root right subtree. The two children of the root node (at depth 1) deal with the next most significant bit of h (C). And so on, with internal nodes at depth D engaged in the D'th-most significant bit of h (C). With a good hash function, each step down from the root is approximately half the candidate nodes in the selected subtree, so we get a depth tree of approximately ln_2 N. (If we finish with the tree with "too unbalanced", somehow make everyone agree on some kind of another hash function from some universal hash family to rebuild the tree before adding any files to the cache until we get a tree that is "not too unbalanced").

After the tree has been built, we never need to change it no matter how many F file names we will meet later. We change it only when adding or removing cache nodes from the system.

file name search

For the file name F, which occurs with a hash up to h (F) = 0 (all zero bits), we find the node cache with the highest weight (for this file name), starting from the root and always take the correct subtree whenever possible. If this leads us to an internal node that does not have the correct subtree, then we take its left subtree. Continue until we reach node without left or right subtree, i.e. A node sheet that contains the name of the selected node C cache.

When viewing any other file with the name F, we first get its name to get h (F), then we start from the root and go to the right or left, respectively (if possible), determined by the next bit in h (F), equal to 0 or 1.

Since the tree (by construction) is not "too unbalanced", traversing the entire tree from root to leaf that contains the name of the selected node C cache requires O (ln N) time in the worst case.

We expect that for a typical set of file names, the hash function h (F) "randomly" selects left or right at each depth of the tree. Since the tree (by construction) is not “too unbalanced”, we expect that each physical node cache will cache approximately the same number of files (within a few of 4 or so).

outlier effects

When some kind of physical node cache fails, each removes the corresponding node leaf from its copy of this tree. (Each also removes each interior node, which then has no descendants of leaves). This does not require moving through any files cached in any other node cache - they are still mapped to the same node cache that they always did. (The rightmost node leaf in the tree is still the largest node leaf in this tree, no matter how many other nodes in this tree are deleted).

For instance,

  .... \ | / \ | | / / \ | X | / \ / \ VWYZ 

Using this O (log N) algorithm, when the node X cache dies, the leaf X is deleted from the tree and all its files become (hopefully relatively evenly) distributed between Y and Z - none of the files from X end with V or W or in any other node cache. All files that previously went to the cache nodes V, W, Y, Z continue to work in the same cache nodes.

rebalancing after dropout

Many missing cache nodes or adding new cache nodes or both can cause the tree to be "too unbalanced." Choosing a new hash function is a big problem after we added a bunch of files to the cache, so instead of choosing a new hash function, as it was when we first built the tree, it might be better to rebalance the tree somehow nodes, rename them with some new semi-random names and then add them back to the system. Repeat until the system is no longer too unbalanced. (Start with the most unbalanced nodes — nodes that cache the least amount of data).

comments

ps: I think this may be pretty close to what mcdowella was thinking, but with more details to clarify that (a) yes, it is log (N), because it is a binary tree that is "not too unbalanced ", (b) it does not have" replicas "and (c) when one node cache is not executed, it does not require reassignment of files that were not in this node cache.

pps: I am sure that on the Wikipedia page it is wrong to imply that typical rendezvous hashing implementations occur at O ​​(log N) time, where N is the number of cache nodes. It seems to me (and I suspect that the original hash designers also) that the time spent (internally, without communication) recalculating the hash against each node in the network will be negligible and there is no need to worry about compared to the time it takes to extract data from some remote node cache.

I understand that rendezvous hashing is almost always implemented using a simple linear algorithm that uses O (N) time, where N is the number of cache nodes, each time we get a new file name F and we want to select the node cache for this file.

Such a linear algorithm has the advantage that it can use a “better” hash function than the previous xor-based w (), so when some kind of physical node cache dies, all the files that were currently cached are expected that dead nodes will be evenly distributed between all other nodes.

0
source

Source: https://habr.com/ru/post/1208942/


All Articles