Tree structure hashing

I just met a script in my project where I need to compare different tree objects for equality with already known instances and thought that some kind of hashing algorithm that works on an arbitrary tree would be very useful.

Take, for example, the following tree:

         O
        / \
       / \
      Oo
     / | \ |
    / |  \ | |
   OOOO
           / \
          / \
         Oo

Where each O represents a tree node, is an arbitrary object, has an associated hash function. Thus, the problem boils down to: considering the hash code of the nodes of the tree structure and the known structure, what is a worthy algorithm for calculating the (relatively) collision hash code for the whole tree?

A few notes about the properties of the hash function:

  • The hash function should depend on the hash code of each node within the tree, as well as on its position.
  • Rearranging the children of a node should clearly change the resulting hash code.
  • Reflection of any part of the tree should clearly change the resulting hash code

If this helps, I use C # 4.0 here in my project, although I am primarily looking for a theoretical solution, so a pseudocode, description or code in another imperative language will be fine.




UPDATE

Well, here is my own suggested solution. A few of these answers were very helpful.

Each node (subtree / leaf node) has the following hash function:

 public override int GetHashCode() { int hashCode = unchecked((this.Symbol.GetHashCode() * 31 + this.Value.GetHashCode())); for (int i = 0; i < this.Children.Count; i++) hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode()); return hashCode; } 

The good thing about this method, as I see it, is that hash codes can only be cached and recounted when a node or one of its descendants changes. (Thanks to Batting and Jason Orendorf for this).

In any case, I would be grateful if people could comment on my proposed solution here - if it does a good job, it’s great, otherwise any possible improvements would be welcome.

+41
algorithm data-structures hash tree
Jan 01 '09 at 2:00
source share
11 answers

If I did this, I would probably do something like the following:

For each node sheet, calculate the concatenation of 0 and the hash of the node data.

For each internal node, calculate concatenation 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade of the tree every time something changes, but it MAY be low enough for the overhead to be useful. If changes are relatively infrequent compared to the number of changes, it may even be appropriate to use a cryptographically secure hash.

Edit1: it is also possible to add a hash valid flag for each node and simply propagate false to the tree (or hash invalid and propagate true) up the tree to change the node. Thus, it may be possible to avoid a full recount when a hash hash is required, and it is possible to avoid numerous hash calculations that are not used, with the risk of slightly less than the predicted time to obtain the hash when necessary.

Edit3: the hash code proposed by Noldorin in the question looks like he will have a chance of collision if the result of GetHashCode can ever be 0. In essence, there is no way to distinguish a tree consisting of one node with a hash of 30 and a "value hash "25 and a tree with two node characters, where the root has a" character hash "of 0 and a" hash value "of 30, and the child element of the node has a common hash of 25. The examples are completely invented, I don't know what expected hash ranges I can just comment on what I see in the code presented.

Using 31 as a multiplicative constant is good because it will cause any overflow on the non-bit boundary, although I think that with enough children and possibly content in the tree, the hash input from the elements hashing early MAY dominate the later hash elements .

However, if the hash works decently on the expected data, it looks like it will do the job. This, of course, is faster than using a cryptographic hash (as is done in the code example below).

Edit2: Regarding specific algorithms and a minimal data structure, something like the following (Python, translation into any other language should be relatively simple).

 #!  / usr / bin / env python

 import Crypto.Hash.SHA

 class Node:
     def __init__ (self, parent = None, contents = "", children = []):
         self.valid = False
         self.hash = False
         self.contents = contents
         self.children = children


     def append_child (self, child):
         self.children.append (child)

         self.invalidate ()

     def invalidate (self):
         self.valid = False
         if self.parent:
             self.parent.invalidate ()

     def gethash (self):
         if self.valid:
             return self.hash

         digester = crypto.hash.SHA.new ()

         digester.update (self.contents)

         if self.children:
             for child in self.children:
                 digester.update (child.gethash ())
             self.hash = "1" + digester.hexdigest ()
         else:
             self.hash = "0" + digester.hexdigest ()

         return self.hash

     def setcontents (self):
         self.valid = False
         return self.contents
+23
Jan 02 '09 at 11:57
source share

Well, after your editing, where you introduced the requirement that the hash result should be different for different tree layouts, you just have to leave the option to traverse the whole tree and write its structure to one array.

This is done as follows: you cross the tree and perform the operations that you perform. For the source tree, which could be (for the structure on the left and right):

 [1, child, 2, child, 3, sibling, 4, sibling, 5, parent, parent, //we're at root again sibling, 6, child, 7, child, 8, sibling, 9, parent, parent] 

You can then assign the list (i.e., actually a string) as you like. As another option, you can even return this list as a result of a hash function, so it becomes a tree without collisions.

But adding accurate information about the entire structure does not mean that hash functions usually function. The proposed method is to calculate the hash function of each node, as well as traverse the entire tree. Therefore, you can consider other hashing methods described below.




If you do not want to move around the tree:

One of the algorithms that immediately crossed my mind is similar to this. Choose a large prime number H (greater than the maximum number of children). To have a hash tree, hash its root, select the child number H mod n , where n is the number of root children and recursively hash the subtree of this child.

This seems to be a bad option if the trees differ only deep in the leaves. But at least he should run fast after not very tall trees.

If you want a hash of fewer elements, but go through the whole tree :

Instead of hashing the subtree, you might want to use a hash layer. That is, the hash root, and not one of the nodes that are its children, then one of the children of the children, etc. This way you cover the whole tree instead of one of the specific paths. This makes the hash process slower, of course.

  --- O ------- layer 0, n=1 / \ / \ --- O --- O ----- layer 1, n=2 /|\ | / | \ | / | \ | O - O - O O------ layer 2, n=4 / \ / \ ------ O --- O -- layer 3, n=2 

A node from the layer is selected with the rule H mod n .

The difference between this version and the previous version is that the tree has to go through a rather illogical transformation in order to save the hash function.

+8
Jan 01 '09 at 14:14
source share

The usual hashing technique of any sequence combines the values ​​(or hashes) of its elements in some mathematical way. I do not think that in this respect the tree will be different.

For example, here is a hash function for tuples in Python (taken from Object / tupleobject.c in a Python 2.6 source):

 static long tuplehash(PyTupleObject *v) { register long x, y; register Py_ssize_t len = Py_SIZE(v); register PyObject **p; long mult = 1000003L; x = 0x345678L; p = v->ob_item; while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1; x = (x ^ y) * mult; /* the cast might truncate len; that doesn't change hash stability */ mult += (long)(82520L + len + len); } x += 97531L; if (x == -1) x = -2; return x; } 

This is a relatively complex combination with constants experimentally selected to obtain the best results for tuples of typical lengths. What I'm trying to show with this code snippet is that the problem is very complex and very heuristic, and the quality of the results probably depends on more specific aspects of your data - that is, domain knowledge can help you achieve better results. However, for good enough results you should not look too far. I would suggest that using this algorithm and combining all the tree nodes instead of all the tuple elements plus adding their position to the game will give you a pretty good algorithm.

One option for position accounting is the default position of the node in the camp tree.

+7
Jan 01 '09 at 14:20
source share

Every time you work with tree recursion, come to mind:

 public override int GetHashCode() { int hash = 5381; foreach(var node in this.BreadthFirstTraversal()) { hash = 33 * hash + node.GetHashCode(); } } 

The hash function should depend on the hash code of each node within the tree, as well as on its position.

Check We explicitly use node.GetHashCode() when computing the tree hash code. In addition, due to the nature of the algorithm, the node position plays a role in the final hash of the tree.

Rearranging the children of a node should clearly change the resulting hash code.

Check They will be visited in a different order in a workaround leading to a different hash code. (Note: if there are two children with the same hash code, you will receive the same hash code when changing the order of these children.)

Reflecting any part of the tree should explicitly change the resulting hash code.

Check Again, the nodes will be visited in a different order, which will lead to a different hash code. (Note that there are situations where reflection can lead to the same hash code if each node is reflected in a node with the same hash code.)

+6
Jan 02 '09 at 3:58
source share

A conflict-free property for this will depend on how merciless the hash function used for the node data.

It looks like you want a system in which the hash of a particular node is a combination of node hashes for children, where order matters.

If you plan to manipulate this tree a lot, you can pay a price in the hash code storage space with each node to avoid a penalty for recounting when performing operations on the tree.

Since the order of the child nodes matters, a method that could work here would be to combine the data and the children of the node using multiple numbers and adding some large amount modulo.

To find something similar to the Java String hash code:

Say you have n child nodes.

 hash(node) = hash(nodedata) + hash(childnode[0]) * 31^(n-1) + hash(childnode[1]) * 31^(n-2) + <...> + hash(childnode[n]) 

More information on the above scheme can be found here: http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

+4
Jan 02 '09 at 13:09
source share

I see that if you have a large set of trees for comparison, you can use the hash function to extract a set of potential candidates, and then do a direct comparison.

The substring that will work just uses the lisp syntax to place the brackets around the tree, write down the identifier of each node in preliminary order. But this is computationally equivalent to pre-matching the tree, so why not just do it?

I gave two solutions: one for comparing two trees when you finished (needed to resolve conflicts), and the other for computing a hash code.

COMPARISON OF TREE:

The most efficient way to compare is to simply recursively move each tree in a fixed order (the preliminary order is simple and no worse than others), comparing the node at each step.

  • So, just create a visitor template that sequentially returns the next node in pre-order for the tree. those. the constructor can take the root of the tree.

  • Then just create two visitor inserts that act as generators for the next node in the preorder. those. Vistor v1 = new visitor (root1), visitor v2 = new visitor (root2)

  • Write a comparison function that can compare with another node.

  • Then just visit each node of the trees, comparing and returning false if the comparison fails. i.e.

Module

  Function Compare(Node root1, Node root2) Visitor v1 = new Visitor(root1) Visitor v2 = new Visitor(root2) loop Node n1 = v1.next Node n2 = v2.next if (n1 == null) and (n2 == null) then return true if (n1 == null) or (n2 == null) then return false if n1.compare(n2) != 0 then return false end loop // unreachable End Function 

End module

GENERATION OF CHARACTERISTIC CODE:

if you want to write a string representation of a tree, you can use the lisp syntax for the tree, and then the sample string to generate a shorter hash code.

Module

  Function TreeToString(Node n1) : String if node == null return "" String s1 = "(" + n1.toString() for each child of n1 s1 = TreeToString(child) return s1 + ")" End Function 

node.toString () can return a unique label / hash / whatever code for this node. You can then simply compare the substring with the strings returned by the TreeToString function to determine if the trees are equivalent. For a shorter hash code, just select the TreeToString function, i.e. Take every 5 characters.

End module

+3
Jan 01 '09 at 14:48
source share

I think you could do it recursively: suppose you have a hash function h that hashes strings of arbitrary length (e.g. SHA-1). Now the tree hash is the hash of the string created as a concatenation of the hash of the current element (for this you have your own function) and the hashes of all the child elements of this node (from recursive function calls).

For a binary tree you should:

Hash( h(node->data) || Hash(node->left) || Hash(node->right) )

You may need to carefully check the correctness of the tree geometry. I think that with some effort, you could get a method for which collision detection for such trees can be as complex as collision detection in the main hash function.

+1
Jan 01 '09 at 14:40
source share

A simple enumeration (in any deterministic order) along with a hash function that depends on visiting the node visitor should work.

 int hash(Node root) { ArrayList<Node> worklist = new ArrayList<Node>(); worklist.add(root); int h = 0; int n = 0; while (!worklist.isEmpty()) { Node x = worklist.remove(worklist.size() - 1); worklist.addAll(x.children()); h ^= place_hash(x.hash(), n); n++; } return h; } int place_hash(int hash, int place) { return (Integer.toString(hash) + "_" + Integer.toString(place)).hash(); } 
+1
Jan 02 '09 at 3:10
source share
 class TreeNode { public static QualityAgainstPerformance = 3; // tune this for your needs public static PositionMarkConstan = 23498735; // just anything public object TargetObject; // this is a subject of this TreeNode, which has to add it hashcode; IEnumerable<TreeNode> GetChildParticipiants() { yield return this; foreach(var child in Children) { yield return child; foreach(var grandchild in child.GetParticipiants() ) yield return grandchild; } IEnumerable<TreeNode> GetParentParticipiants() { TreeNode parent = Parent; do yield return parent; while( ( parent = parent.Parent ) != null ); } public override int GetHashcode() { int computed = 0; var nodesToCombine = (Parent != null ? Parent : this).GetChildParticipiants() .Take(QualityAgainstPerformance/2) .Concat(GetParentParticipiants().Take(QualityAgainstPerformance/2)); foreach(var node in nodesToCombine) { if ( node.ReferenceEquals(this) ) computed = AddToMix(computed, PositionMarkConstant ); computed = AddToMix(computed, node.GetPositionInParent()); computed = AddToMix(computed, node.TargetObject.GetHashCode()); } return computed; } } 

AddToTheMix is ​​a function that combines two hash codes, so the sequence matters. I do not know what it is, but you can understand. You know a bit of offset, rounding, ...

The idea is that you need to analyze some node environment, depending on the quality you want to achieve.

0
Jan 01 '09 at 15:14
source share

I must say that your requirements are somewhat contrary to the whole concept of hash codes.

The complexity of computing a hash function must be very limited.

This computational complexity should not linearly depend on the size of the container (tree), otherwise it completely destroys hash-based algorithms.

Considering a position as the main property of a hash function of nodes also contradicts the concept of a tree, but is achievable if you replace the requirement that it should depend on the position.

The general principle that I would suggest replaces the MUST requirements with the SHOULD requirements. Thus, you can find a suitable and efficient algorithm.

For example, consider creating a limited sequence of whole hashcode tokens and add what you want to this sequence, in order of preference.

The order of the elements in this sequence is important; it affects the calculated value.

for example, for each node you want to calculate:

  • add hash code of base object
  • add hash codes of the base objects of the closest siblings, if available. I think even one left brother would be enough.
  • add the hash code of the base object of the parent and the next siblings, as for the node itself, as well as 2.
  • repeat this with grandparents at a limited depth.

     //--------5------- ancestor depth 2 and it left sibling; //-------/|------- ; //------4-3------- ancestor depth 1 and it left sibling; //-------/|------- ; //------2-1------- this; 

    the fact that you add the hash code of the direct source object associated with the site gives the property of positioning the hash function.

    If this is not enough, add children: you must add each child, only a few, to give a decent hash code.

  • add the first child and the first child and first child .. limit the depth to some constant and do not calculate anything recursively - only the base node hash code.

     //----- this; //-----/--; //----6---; //---/--; //--7---; 

Thus, complexity is linear with respect to the depth of the base tree, and not with the total number of elements.

Now you have a sequence, if integers, combine them with a well-known algorithm, as Eli suggests above.

1,2, ... 7

Thus, you will have a lightweight hash function with a positional property that does not depend on the total size of the tree and does not even depend on the depth of the tree, and does not require recalculation of the hash function of the whole tree when you change the tree structure.

I bet that these 7 numbers will give a hash sacrifice along with perfection.

0
Jan 01 '09 at 16:18
source share

Writing your own hash function is almost always a mistake, because you basically need a degree in mathematics to do it well. Hashfunctions .

- - -. node -. - . , .

0
07 . '10 6:04
source share



All Articles