Family Tree Algorithm

Question

Family Tree Algorithm

I am working on a task to be asked for an in-level CS course and came up with a question that at first glance seems very simple:

You are provided with a list of people with the names of their parents, their dates of birth and the dates of their death. You are interested in finding out who at some point in your life was a parent, grandparent, great-grandmother, etc. Develop an algorithm to mark each person this information as an integer (0 means that the person never had a child, 1 means that the person was a parent, 2 means that the person was grandfather and grandmother, etc.)

For simplicity, you can assume that the family graph is a DAG whose non-oriented version is a tree.

An interesting problem here is that you cannot just look at the shape of the tree to determine this information. For example, I have 8 great-great-grandmothers and grandmothers, but since not one of them was alive when I was born, in their life not one of them was a great great-grandmother and grandmother.

The best algorithm I can come up with for this problem runs in O (n ² ) time, where n is the number of people. The idea is simple - to start DFS with each person, finding a more distant descendant in the family tree that was born before the date of death of that person. However, I am sure that this is not an optimal solution to the problem. For example, if a graph is just two parents and their n children, then the problem can be solved trivially in O (n). I hope this is some kind of algorithm that either surpasses O (n ² ), or whose execution time is parameterized according to the shape of the graph, which makes it fast for wide graphs with graceful degradation to O (n ² ) in the worst case.

+44

algorithm graph family-tree tree

templatetypedef May 23 '11 at 19:43

source share

9 answers

Update . This is not the best solution that I came across, but I left it because there are so many comments.

You have a set of events (birth / death), a parental state (no descendants, a parent, grandmother, etc.) and a life state (living, dead).

I would save my data in structures with the following fields:

 mother father generations is_alive may_have_living_ancestor

Sort your events by date, and then for each event, take one of the following two courses of logic:

 Birth: Create new person with a mother, father, 0 generations, who is alive and may have a living ancestor. For each parent: If generations increased, then recursively increase generations for all living ancestors whose generations increased. While doing that, set the may_have_living_ancestor flag to false for anyone for whom it is discovered that they have no living ancestors. (You only iterate into a person ancestors if you increased their generations, and if they still could have living ancestors.) Death: Emit the person name and generations. Set their is_alive flag to false.

The worst case is O(n*n) if everyone has many living ancestors. However, as a rule, you have a sorting preprocessing step that is O(n log(n)) , and then you are O(n * avg no of living ancestors) , which means that the total time tends to be O(n log(n)) . (I did not think the sorting was proper, thanks @Alexey Kukanov for the fix.)

+11

btilly May 23 '11 at 20:07

source share

My suggestion:

in addition to the values described in the statement of the problem, each personal record will have two fields: a child counter and a dynamically growing vector (in the sense of C ++ / STL), which will save the earliest birthday in each generation of human descendants.
use a hash table to store data, with the person’s name being the key. Its creation time is linear (provided that the hash function is good, the map depreciates constant time for inserts and finds).
for each person, determine and maintain the number of children. This is also done in linear time: for each personal record, find the record for your parents and increase their counters. This step can be combined with the previous one: if the record for the parent is not found, it is created and added, and details (dates, etc.) will be added when they are found in the input.
cross the map and put links to all personal records without any children in the queue. However, O(N) .
for each item removed from the queue:
- add this person’s birthday to descendant_birthday[0] for both parents (increase this vector if necessary). If this field is already set, change it only if the new date was earlier.
- For all descendant_birthday[i] dates available in the current record vector, follow the same rules as above to update descendant_birthday[i+1] in the parent records.
- decrement of parental child counters; if it reaches 0, add the appropriate parent entry to the queue.
- the cost of this step is O(C*N) , with C being the largest “family depth” for a given input (ie, the size of the longest vector descendant_birthday ). For realistic data, it can be limited by some reasonable constant without loss of correctness (as others have already pointed out), and therefore does not depend on N.
go to the map again and mark each person with the largest i , for which descendant_birthday[i] is still earlier than the date of death; also O(C*N) .

Thus, for realistic data, a solution to the problem can be found in linear time. Although for far-fetched data, as suggested by @btilly's comment, C can be large and even of order N in degenerate cases. It can be solved by putting the cap on the size of the vector or by expanding the algorithm using step 2 of @btilly's solution.

A hash table is a key part of the solution if the parent-child relationships in the input are provided through names (as written in the statement of the problem). Without hashes, O(N log N) to build a relationship graph. Most of the other proposed solutions seem to suggest that a relationship schedule already exists.

+5

Alexey Kukanov May 26 '11 at 13:14

source share

Create a list of people sorted by birth_date . Create another list of people sorted by death_date . You can travel logically in time by pushing people out of these lists to get a list of events as they occur.

For each Person, define the is_alive field. At first it will be FALSE for everyone. When people are born and die, update this entry accordingly.

Define a different field for each person, called has_a_living_ancestor , first initialized to FALSE for everyone. At birth x.has_a_living_ancestor will be set to x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor . Thus, for most people (but not for everyone) this will be set to TRUE at birth.

The challenge is to identify cases where has_a_living_ancestor can be set to FALSE. Every time a person is born, we do DFS through ancestors, but only those ancestors for whom ancestor.has_a_living_ancestor || ancestor.is_alive ancestor.has_a_living_ancestor || ancestor.is_alive true.

During this DFS, if we find an ancestor who has no living ancestors and is now dead, then we can set has_a_living_ancestor to FALSE. This means that I think that sometimes has_a_living_ancestor will be deprecated, but hopefully it will be quickly caught.

+3

Aaron McDaid May 23 '11 at 22:37

source share

Below is the O (n log n) algorithm, which works for graphs in which each child has no more than one parent (EDIT: this algorithm does not apply to the dual-user case with O (n log n) performance). It is worth noting that I consider that performance can be improved to O (n log (maximum level label)) with additional work.

One parent case:

For each node x in the reverse topological order, create a binary search tree T_x, which strictly increases both at the date of birth and in the number of generations removed from x. (T_x contains the first born baby c1 in the subgraph of the pedigree graph embedded in x, along with the next earliest born c2 in this subgraph, so c2 is the “grandfather and grandmother’s level” is strictly higher than c1, along with the next earliest born baby c3 in this subgraph, such that the level of c3 is strictly greater than c2, etc.). To create T_x, we combine the previously constructed T_w trees, where w is a child of x (they were previously built because we are repeating in the reverse topological order).

If we are careful with how we perform the mergers, we can show that the total cost of such unions is O (n log n) for the entire ancestral graph. The main idea is to note that after each merge, no more than one node of each level is saved in the combined tree. To each tree T_w we associate the potential h (w) log n, where h (w) is equal to the length of the longest path from w to the leaf.

When we combine the T_w child trees to create T_x, we “destroy” all the T_w trees, freeing up all the potential they store for use in building the T_x tree; and we create a new tree T_x with potential (log n) (h (x)). Thus, our goal is to spend no more than O ((log n) (sum_w (h (w)) - h (x) + constant)) time to create T_x from T_w trees, so the amortized cost of the merger will be only O (log n). This can be achieved by choosing the tree T_w so that h (w) is maximum as the starting point for T_x, and then modified T_w to create T_x. After such a choice is made for T_x, we combine each of the other trees one by one in T_x with an algorithm similar to the standard algorithm for merging two binary search trees.

Essentially, merging is done by repeating each node y in T_w, searching for the y-predecessor of z by date of birth, and then inserting y into T_x if there are more levels removed from x than z; then, if z was inserted into T_x, we look for a node in T_x of the lowest level, strictly exceeding the level z, and splicing intermediate nodes to maintain the invariant that T_x is ordered strictly by date of birth and by level. It costs O (log n) for each node in T_w, and in T_w there are at most O (h (w)) nodes, so the total cost of merging all the trees is O ((log n) (sum_w (h (w))) summing over all children w, except that the child w 'is such that h (w') is maximal.

We store the level associated with each T_x element in the auxiliary field of each node in the tree. We need this value so that we can determine the actual level of x as soon as we built T_x. (As a technical detail, we actually save the difference in each level of the node with the level of its parent in T_x, so that we can quickly increase the values for all nodes in the tree. This is a standard BST trick.)

What is it. Just note that the initial potential is 0, and the final potential is positive, so the sum of amortized estimates is the upper limit of the total value of all mergers throughout the tree. We will find the label of each node x as soon as we create the BST T_x, by binary searching for the last element in T_x, which was born before x died at the price of O (log n).

To improve binding to O (n log (maximum level label)), you can easily merge trees only by combining the first few elements of the tree as needed to provide a solution for the current node. If you use a BST that uses link locality, such as the splay tree, then you can achieve the above.

We hope that the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.

+3

jonderry May 30 '11 at 5:32 a.m.

source share

I have a hunch that getting a comparison for each person (generation → date of birth of the first descendant in this generation).

Since dates must be strictly increasing, we could use a binary search (or a neat data structure) to find the most distant living descendant in O (log n) time.

The problem is that combining these lists (at least naively) is O (number of generations), so in the worst case it can be O (n ^ 2) (consider A and B are parents of C and D, which are parents E and F ...).

I still need to figure out how the best case works, and try to better identify the worst cases (and see if they have a workaround)

+2

hugomg May 25 '11 at 21:14

source share

We recently implemented a relationship module in one of our projects, in which we had everything in the database, and yes, I think the algorithm was the best 2nO (m) (m is the maximum branching factor). I multiplied operations on N twice, because in the first round we create a relationship schedule, and in the second round we visit each Person. We maintained a bi-directional connection between every two nodes. During navigation, we use only one direction for travel. But we have two sets of operations, one only children, the other only parent.

 Person{ String Name; // all relations where // this is FromPerson Relation[] FromRelations; // all relations where // this is ToPerson Relation[] ToRelations; DateTime birthDate; DateTime? deathDate; } Relation { Person FromPerson; Person ToPerson; RelationType Type; } enum RelationType { Father, Son, Daughter, Mother }

This view looks like a bidirectional graph. But in this case, you first create a list of the entire Person, and then you can create list relationships and configure FromRelations and ToRelations between each node. Then all you need to do is for each Person, you should only focus on ToRelations type (Son, Daughter). And since you have a date, you can calculate everything.

I do not have time to check the correctness of the code, but this will give you an idea of how to do this.

 void LabelPerson(Person p){ int n = GetLevelOfChildren(p, p.birthDate, p.deathDate); // label based on n... } int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){ List<int> depths = new List<int>(); foreach(Relation r in p.ToRelations.Where( x=>x.Type == Son || x.Type == Daughter)) { Person child = r.ToPerson; if(ed!=null && child.birthDate <= ed.Value){ depths.Add( 1 + GetLevelOfChildren( child, bd, ed)); }else { depths.Add( 1 + GetLevelOfChildren( child, bd, ed)); } } if(depths.Count==0) return 0; return depths.Max(); }

+2

Akash Kava May 28 '11 at 10:17

source share

Here is my hit:

 class Person { Person [] Parents; string Name; DateTime DOB; DateTime DOD; int Generations = 0; void Increase(Datetime dob, int generations) { // current person is alive when caller was born if (dob < DOD) Generations = Math.Max(Generations, generations) foreach (Person p in Parents) p.Increase(dob, generations + 1); } void Calculate() { foreach (Person p in Parents) p.Increase(DOB, 1); } } // run for everyone Person [] people = InitializeList(); // create objects from information foreach (Person p in people) p.Calculate();

0

Jason Goemaat May 25 '11 at 20:53

source share

There is a relatively simple O (n log n) algorithm that chronologically chronologically records events using a suitable top tree .
You really should not assign homework that you cannot solve on your own.

-2

madman May 23 '11 at 20:03

source share

btilly · Accepted Answer · 2011-05-26 16:53

I thought about it this morning, then I discovered that Alexei Kukanov had similar thoughts. But mine is more connected and has some optimization, so I will publish it anyway.

This algorithm is O(n * (1 + generations)) and will work for any data set. For realistic data, this is O(n) .

Run all the records and create objects representing people, which include the date of birth, links to parents and links to children, as well as several other uninitialized fields. (The time of the last death between themselves and their ancestors and many dates in which they had 0, 1, 2, ... surviving generations).
Go through all the people and recursively find and save the time of the last death. If you call the person again, return the saved record. For each person, you may encounter a person (he needs to calculate it), and he can generate 2 more calls for each parent at his first calculation. This gives the total amount of O(n) to initialize this data.
Go through all the people and recursively create a record when they first added a generation. These records are only needed to the maximum extent when a person or their last ancestor died. O(1) calculate when you had 0 generations. Then, for each recursive call to the child, you need to do O(generations) work to merge this child data into yours. Each user receives a call when you encounter him in the data structure, and can be called once from each parent for calls O(n) and total consumption O(n * (generations + 1)) .
Go through all the people and find out how many generations were alive at their death. This is again O(n * (generations + 1)) if it is implemented using linear scanning.

The sum of all these operations is O(n * (generations + 1)) .

For realistic datasets, this will be O(n) with a fairly small constant.

Family Tree Algorithm

More articles: