How can I generate all possible Newick Tree permutations for a set of views based on a group?

How do I create all possible Newick Tree lookups for a set of views defined by outgroup?

For those who do not know what the Newick tree format is, a good description is available at: https://en.wikipedia.org/wiki/Newick_format

I want to create all possible Newick Tree lookups for the set of views defined for outgroup. The number of leaf nodes that I expect to process is likely 4, 5, or 6 leaf nodes.

Both "soft" and "hard" polytomies are allowed. https://en.wikipedia.org/wiki/Polytomy#Soft_polytomies_vs._hard_polytomies https://biology.stackexchange.com/questions/23667/evidence-discussions-of-hard-polytomy

Below is the perfect exit, and "E" is set as outgroup

Ideal yield:

((("A","B","C"),("D"),("E")); ((("A","B","D"),("C"),("E")); ((("A","C","D"),("B"),("E")); ((("B","C","D"),("A"),("E")); ((("A","B")("C","D"),("E")); ((("A","C")("B","D"),("E")); ((("B","C")("A","D"),("E")); (("A","B","C","D"),("E")); (((("A","B"),"C"),"D"),("E")); 

However, any possible solutions that I used when using itertools, in particular itertools.permutations, ran into an equivalent output problem. The last idea I came up with included the equivalent output shown below.

Equivalent output:

 ((("C","B","A"),("D"),("E")); ((("C","A","B"),("D"),("E")); ((("A","C","B"),("D"),("E")); 

Here is the beginning of my idea for a solution. However, I am not sure yet what to do with this problem except for itertools.

 import itertools def Newick_Permutation_Generator(list_of_species, name_of_outgroup) permutations_list =(list(itertools.permutations(["A","B","C","D","E"]))) for given_permutation in permutations_list: process(given_permutation) Newick_Permutation_Generator(["A","B","C","D","E"], "E") 
+6
source share
2 answers

A tree as a recursive set of leaves

Let's set aside a new view for now and think about a possible view of the problem in python.

The root tree can be considered as a recursive hierarchy of sets (sets (sets ...)). The sets are unordered, which is quite suitable for describing treasures in a tree: {{{"A", "B"}, {"C", "D"}}, "E"} should be the same as {{{"C", "D"}, {"B", "A"}}, "E"} .

If we consider the initial set of leaves {"A", "B", "C", "D", "E"} , trees with "E" as an outgroup are many sets in the form {tree, "E"} , where tree taken from a set of trees that can be built from many leaves {"A", "B", "C", "D"} . We could try to write a trees recursive function to generate this set of trees, and our common set of trees will be expressed as follows:

 {{tree, "E"} for tree in trees({"A", "B", "C", "D"})} 

(Here I use establish understanding .)

In fact, python does not allow sets of sets because the elements of the set must be "hashable" (that is, python must be able to calculate some "hash values" of objects in order to be able to check whether they belong to or not to the set). It happens that python collections do not have this property. Fortunately, we can use a similar frozenset data structure that behaves like a collection, but cannot be modified and hashed, so our collection of trees will:

 all_trees = frozenset( {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})}) 

Trees function implementation

Now focus on the trees function.

For each possible section (decomposition into many disjoint subsets, including all elements) of the leaf set, we need to find all possible trees (through a recursive call) for each part of the section. For this section, we will then create a tree for each possible combination of subtrees taken in parts.

For example, if the section is {"A", {"B", "C", "D"}} , we consider all possible trees that can be made from part "A" (in fact, only the sheet "A" ) and all possible trees that can be made from part {"B", "C", "D"} (i.e. trees({"B", "C", "D"}) ). Then the possible trees for this section will be obtained by taking all possible pairs, where one element comes only from "A" and the other from trees({"B", "C", "D"}) .

This can be generalized to partitions with more than two parts, and the product function from itertools seems useful here.

Therefore, we need a way to generate possible partitions of multiple leaves.

Creating dial sections

Here I made the partitions_of_set function adapted from this solution :

 # According to /questions/291868/set-partitions-in-python/1442126#1442126: # The problem is solved recursively: # If you already have a partition of n-1 elements, how do you use it to partition n elements? # Either place the n'th element in one of the existing subsets, or add it as a new, singleton subset. def partitions_of_set(s): if len(s) == 1: yield frozenset(s) return # Extract one element from the set # https://stackoverflow.com/a/43804050/1878788 elem, *_ = s rest = frozenset(s - {elem}) for partition in partitions_of_set(rest): for subset in partition: # Insert the element in the subset try: augmented_subset = frozenset(subset | frozenset({elem})) except TypeError: # subset is actually an atomic element augmented_subset = frozenset({subset} | frozenset({elem})) yield frozenset({augmented_subset}) | (partition - {subset}) # Case with the element in its own extra subset yield frozenset({elem}) | partition 

To check the resulting sections, we will make a function that simplifies their display (which will also be useful for creating a new tree view later):

 def print_set(f): if type(f) not in (set, frozenset): return str(f) return "(" + ",".join(sorted(map(print_set, f))) + ")" 

We check if the splitting works:

 for partition in partitions_of_set({"A", "B", "C", "D"}): print(len(partition), print_set(partition)) 

Output:

 1 ((A,B,C,D)) 2 ((A,B,D),C) 2 ((A,C),(B,D)) 2 ((B,C,D),A) 3 ((B,D),A,C) 2 ((A,B,C),D) 2 ((A,B),(C,D)) 3 ((A,B),C,D) 2 ((A,D),(B,C)) 2 ((A,C,D),B) 3 ((A,D),B,C) 3 ((A,C),B,D) 3 ((B,C),A,D) 3 ((C,D),A,B) 4 (A,B,C,D) 

Actual function code trees

Now we can write a tree function:

 from itertools import product def trees(leaves): if type(leaves) not in (set, frozenset): # It actually is a single leaf yield leaves # Don't try to yield any more trees return # Otherwise, we will have to consider all the possible # partitions of the set of leaves, and for each partition, # construct the possible trees for each part for partition in partitions_of_set(leaves): # We need to skip the case where the partition # has only one subset (the initial set itself), # otherwise we will try to build an infinite # succession of nodes with just one subtree if len(partition) == 1: part, *_ = partition # Just to be sure the assumption is correct assert part == leaves continue # We recursively apply *tree* to each part # and obtain the possible trees by making # the product of the sets of possible subtrees. for subtree in product(*map(trees, partition)): # Using a frozenset guarantees # that there will be no duplicates yield frozenset(subtree) 

Testing:

 all_trees = frozenset( {frozenset({tree, "E"}) for tree in trees({"A", "B", "C", "D"})}) for tree in all_trees: print(print_set(tree) + ";") 

Output:

 (((B,C),A,D),E); ((((A,B),D),C),E); ((((B,D),A),C),E); ((((C,D),A),B),E); (((A,D),B,C),E); ((A,B,C,D),E); ((((B,D),C),A),E); (((A,B,C),D),E); ((((A,C),B),D),E); ((((C,D),B),A),E); ((((B,C),A),D),E); (((A,B),C,D),E); (((A,C),(B,D)),E); (((B,D),A,C),E); (((C,D),A,B),E); ((((A,B),C),D),E); ((((A,C),D),B),E); (((A,C,D),B),E); (((A,D),(B,C)),E); ((((A,D),C),B),E); ((((B,C),D),A),E); (((A,B),(C,D)),E); (((A,B,D),C),E); ((((A,D),B),C),E); (((A,C),B,D),E); (((B,C,D),A),E); 

I hope the result is correct.

This approach was a bit complicated to get right. It took me a while to figure out how to avoid infinite recursion (this happens when the section is {{"A", "B", "C", "D"}} ).

+6
source

That was a difficult question! This is the path that I have made.

The first observation is that outgroup is always the only node attached to the end of the newick line. Let me call the rest of the groups a group and try to generate all their permutations. Then just add outgroup.

 from itertools import permutations def ingroup_generator(species, n): for perm in permutations(species, n): yield tuple([tuple(perm), tuple(s for s in species if s not in perm)]) def format_newick(s, outgroup=''): return '(' + ', '.join('({})'.format(', '.join(p)) for p in s) + ',({}));'.format(outgroup) species = ["A","B","C","D","E"] outgroup = "E" ingroup = [s for s in species if s != outgroup] itertools_newicks= [] for n in range(1, len(ingroup)): for p in ingroup_generator(ingroup, n): itertools_newicks.append(format_newick(p, outgroup)) for newick in itertools_newicks: print newick 

This returns 40 new lines:

 ((A), (B, C, D),(E)); ((B), (A, C, D),(E)); ((C), (A, B, D),(E)); ((D), (A, B, C),(E)); ((A, B), (C, D),(E)); ((A, C), (B, D),(E)); ((A, D), (B, C),(E)); ((B, A), (C, D),(E)); ((B, C), (A, D),(E)); ((B, D), (A, C),(E)); ((C, A), (B, D),(E)); ((C, B), (A, D),(E)); ((C, D), (A, B),(E)); ((D, A), (B, C),(E)); ((D, B), (A, C),(E)); ((D, C), (A, B),(E)); ((A, B, C), (D),(E)); ((A, B, D), (C),(E)); ((A, C, B), (D),(E)); ((A, C, D), (B),(E)); ((A, D, B), (C),(E)); ((A, D, C), (B),(E)); ((B, A, C), (D),(E)); ((B, A, D), (C),(E)); ((B, C, A), (D),(E)); ((B, C, D), (A),(E)); ((B, D, A), (C),(E)); ((B, D, C), (A),(E)); ((C, A, B), (D),(E)); ((C, A, D), (B),(E)); ((C, B, A), (D),(E)); ((C, B, D), (A),(E)); ((C, D, A), (B),(E)); ((C, D, B), (A),(E)); ((D, A, B), (C),(E)); ((D, A, C), (B),(E)); ((D, B, A), (C),(E)); ((D, B, C), (A),(E)); ((D, C, A), (B),(E)); ((D, C, B), (A),(E)); 

Some of them are duplicates, but later we will remove duplicates.

As bli noted in the comments , (((("A","B"),"C"),"D"),("E")); , and its options should also be considered valid solutions. Comments on BioStar pointed me in the right direction that it is the same as generating all possible binary tree groupings. I found a nice fooobar.com/questions/1145816 / ... :

 # A very simple representation for Nodes. Leaves are anything which is not a Node. class Node(object): def __init__(self, left, right): self.left = left self.right = right def __repr__(self): return '(%s, %s)' % (self.left, self.right) # Given a tree and a label, yields every possible augmentation of the tree by # adding a new node with the label as a child "above" some existing Node or Leaf. def add_leaf(tree, label): yield Node(label, tree) if isinstance(tree, Node): for left in add_leaf(tree.left, label): yield Node(left, tree.right) for right in add_leaf(tree.right, label): yield Node(tree.left, right) # Given a list of labels, yield each rooted, unordered full binary tree with # the specified labels. def enum_unordered(labels): if len(labels) == 1: yield labels[0] else: for tree in enum_unordered(labels[1:]): for new_tree in add_leaf(tree, labels[0]): yield new_tree 

Then

 enum_newicks= [] for t in enum_unordered(ingroup): enum_newicks.append('({},({}));'.format(t, outgroup)) for newick in enum_newicks: print newick 

delivers the following 15 innovations:

 ((A, (B, (C, D))),(E)); (((A, B), (C, D)),(E)); ((B, (A, (C, D))),(E)); ((B, ((A, C), D)),(E)); ((B, (C, (A, D))),(E)); ((A, ((B, C), D)),(E)); (((A, (B, C)), D),(E)); ((((A, B), C), D),(E)); (((B, (A, C)), D),(E)); (((B, C), (A, D)),(E)); ((A, (C, (B, D))),(E)); (((A, C), (B, D)),(E)); ((C, (A, (B, D))),(E)); ((C, ((A, B), D)),(E)); ((C, (B, (A, D))),(E)); 

So, now we already have 40 + 15 = 55 possible new lines, and we have to remove duplicates.

The first dead end I tried was to create a canonical representation of each new line so that I could use them as keys in a dictionary. The idea was to recursively sort the rows in all nodes. But first, I had to grab all the (nested) nodes. I could not use regular expressions because nested structures are by definition not regular .

So, I used the pyparsing package and came up with the following:

 from pyparsing import nestedExpr def sort_newick(t): if isinstance(t, str): return sorted(t) else: if all(isinstance(c, str) for c in t): return sorted(t) if all(isinstance(l, list) for l in t): return [sort_newick(l) for l in sorted(t, key=lambda k: sorted(k))] else: return [sort_newick(l) for l in t] def canonical_newick(n): n = n.replace(',', '') p = nestedExpr().parseString(n).asList() s = sort_newick(p) return str(s) 

It gave for

 from collections import defaultdict all_newicks = itertools_newicks + enum_newicks d = defaultdict(list) for newick in all_newicks: d[canonical_newick(newick)].append(newick) for canonical, newicks in d.items(): print canonical for newick in newicks: print '\t', newick print 

Dictionary with 22 keys:

 [[[['A'], [['C'], ['B', 'D']]], ['E']]] ((A, (C, (B, D))),(E)); [[[['B'], [['A'], ['C', 'D']]], ['E']]] ((B, (A, (C, D))),(E)); [[[['B'], [['A', 'C'], ['D']]], ['E']]] ((B, ((A, C), D)),(E)); [[['A', 'C', 'D'], ['B'], ['E']]] ((B), (A, C, D),(E)); ((A, C, D), (B),(E)); ((A, D, C), (B),(E)); ((C, A, D), (B),(E)); ((C, D, A), (B),(E)); ((D, A, C), (B),(E)); ((D, C, A), (B),(E)); [[['A', 'B'], ['C', 'D'], ['E']]] ((A, B), (C, D),(E)); ((B, A), (C, D),(E)); ((C, D), (A, B),(E)); ((D, C), (A, B),(E)); [[[[['A'], ['B', 'C']], ['D']], ['E']]] (((A, (B, C)), D),(E)); [[[['A', 'C'], ['B', 'D']], ['E']]] (((A, C), (B, D)),(E)); [[['A'], ['B', 'C', 'D'], ['E']]] ((A), (B, C, D),(E)); ((B, C, D), (A),(E)); ((B, D, C), (A),(E)); ((C, B, D), (A),(E)); ((C, D, B), (A),(E)); ((D, B, C), (A),(E)); ((D, C, B), (A),(E)); [[[['A', 'D'], ['B', 'C']], ['E']]] (((B, C), (A, D)),(E)); [[['A', 'B', 'C'], ['D'], ['E']]] ((D), (A, B, C),(E)); ((A, B, C), (D),(E)); ((A, C, B), (D),(E)); ((B, A, C), (D),(E)); ((B, C, A), (D),(E)); ((C, A, B), (D),(E)); ((C, B, A), (D),(E)); [[['A', 'C'], ['B', 'D'], ['E']]] ((A, C), (B, D),(E)); ((B, D), (A, C),(E)); ((C, A), (B, D),(E)); ((D, B), (A, C),(E)); [[['A', 'B', 'D'], ['C'], ['E']]] ((C), (A, B, D),(E)); ((A, B, D), (C),(E)); ((A, D, B), (C),(E)); ((B, A, D), (C),(E)); ((B, D, A), (C),(E)); ((D, A, B), (C),(E)); ((D, B, A), (C),(E)); [[[['A'], [['B'], ['C', 'D']]], ['E']]] ((A, (B, (C, D))),(E)); [[[[['A', 'B'], ['C']], ['D']], ['E']]] ((((A, B), C), D),(E)); [[[[['B'], ['A', 'C']], ['D']], ['E']]] (((B, (A, C)), D),(E)); [[[['C'], [['B'], ['A', 'D']]], ['E']]] ((C, (B, (A, D))),(E)); [[[['C'], [['A', 'B'], ['D']]], ['E']]] ((C, ((A, B), D)),(E)); [[[['A'], [['B', 'C'], ['D']]], ['E']]] ((A, ((B, C), D)),(E)); [[[['A', 'B'], ['C', 'D']], ['E']]] (((A, B), (C, D)),(E)); [[[['B'], [['C'], ['A', 'D']]], ['E']]] ((B, (C, (A, D))),(E)); [[[['C'], [['A'], ['B', 'D']]], ['E']]] ((C, (A, (B, D))),(E)); [[['A', 'D'], ['B', 'C'], ['E']]] ((A, D), (B, C),(E)); ((B, C), (A, D),(E)); ((C, B), (A, D),(E)); ((D, A), (B, C),(E)); 

But closer inspection revealed some problems. Let's look, for example, on newicks '(((A, B), (C, D)),(E)); and ((D, C), (A, B),(E)); . In our d dictionary, they have another canonical key, respectively [[[['A', 'B'], ['C', 'D']], ['E']]] and [[['A', 'B'], ['C', 'D'], ['E']]] . But in reality these are duplicate trees. We can confirm this by looking at the Robinson-Fould distance between two trees. If it is zero, the trees are identical.

We use robinson_foulds function from ete3 toolkit

 from ete3 import Tree tree1 = Tree('(((A, B), (C, D)),(E));') tree2 = Tree('((D, C), (A, B),(E));') rf, max_parts, common_attrs, edges1, edges2, discard_t1, discard_t2 = tree1.robinson_foulds(tree2, unrooted_trees=True) print rf # returns 0 

OK, so Robinson Folds is the best way to test a new tree, and then my approach to the canonical tree. Allow all newlines to be wrapped in a custom MyTree object, where equality is defined as having a Robinson-Fold distance from zero:

 class MyTree(Tree): def __init__(self, *args, **kwargs): super(MyTree, self).__init__(*args, **kwargs) def __eq__(self, other): rf = self.robinson_foulds(other, unrooted_trees=True) return not bool(rf[0]) trees = [MyTree(newick) for newick in all_newicks] 

It would be ideal if we could also define a __hash__() function that returns the same value for duplicate trees, then set(trees) will automatically delete all duplicates.

Unfortunately, I could not find a good way to define __hash__() , but with __eq__ in place, I could use index() :

 unique_trees = [trees[i] for i in range(len(trees)) if i == trees.index(trees[i])] unique_newicks = [tree.write(format=9) for tree in unique_trees] for unique_newick in unique_newicks: print unique_newick 

So here we are at the end of our journey. I cannot fully prove that this is the right solution, but I am sure that the following 19 new elements are all possible different permutations:

 ((A),(B,C,D),(E)); ((B),(A,C,D),(E)); ((C),(A,B,D),(E)); ((D),(A,B,C),(E)); ((A,B),(C,D),(E)); ((A,C),(B,D),(E)); ((A,D),(B,C),(E)); ((A,(B,(C,D))),(E)); ((B,(A,(C,D))),(E)); ((B,((A,C),D)),(E)); ((B,(C,(A,D))),(E)); ((A,((B,C),D)),(E)); (((A,(B,C)),D),(E)); ((((A,B),C),D),(E)); (((B,(A,C)),D),(E)); ((A,(C,(B,D))),(E)); ((C,(A,(B,D))),(E)); ((C,((A,B),D)),(E)); ((C,(B,(A,D))),(E)); 

If we compare each newick in pairs with all the other newcomers, we get confirmation that there are more duplicates in this list

 from itertools import product for n1, n2 in product(unique_newicks, repeat=2): if n1 != n2: mt1 = MyTree(n1) mt2 = MyTree(n2) assert mt1 != mt2 
+2
source

Source: https://habr.com/ru/post/1272497/


All Articles