How self-healing improves the quality of phylogenetic reconstruction

Hi guys: My understanding of loading is what you

1) Build a "tree" using some algorithm from a matrix of sequences (suppose nucleotides). 2) You store this tree. 3) Move the matrix from 1 and rebuild the tree.

My question is: what is goal 3 from the perspective of bioinformatics sequence? I can try to “guess” that by changing the characters in the original matrix, you can delete artifacts in the data --- but I have a problem with this hunch: I'm not sure why it is necessary to delete such artifacts - - SUPPOSED sequence alignment to combat artifacts by finding long lengths of similarity by nature ....

+6
source share
2 answers

Bootstraping, in phylogenetics, as elsewhere, does not improve the quality of what you are trying to evaluate (a tree in this case). What he does is give you an idea of how confident you can be about what you get from your original dataset. The bootstrap analysis answers the question "If I repeat this experiment many times using a sample (but the same size) each time, how often do I expect to get the same result?" This is usually broken to the edge ("How often do I expect to see this particular edge in the tree that has been output?").

Sampling error

More precisely, self-tuning is a way to approximate the expected level of sampling error in your estimate. Most evolutionary models have the property that if your data set had an infinite number of sites, you are guaranteed to restore the correct tree and the correct branch lengths *. But with a finite number of sites, this guarantee disappears. What you do in these circumstances may be considered the correct tree selection error plus sampling, where the selection error tends to decrease with increasing sample size (number of sites). We want to know how many sampling errors should be expected for each region, given that we have (say) 1000 sites.

What we would like to do but cannot

Suppose you used alignment of 1000 sites to display the source tree. If you had the opportunity to arrange as many sites as you wanted for all of your taxa, you could extract another 1000 sites from each and again execute this tree output, in which case you will probably get a tree that looked like, but slightly different from the original tree. You can do it again and again, each time using a new portion of 1000 sites; if you have done this many times, as a result you have created a distribution of trees. This is called the distribution of the estimation sample. In general, it will have the highest density near a true tree. It also becomes more focused around the true tree if you increase the sample size (number of sites).

What does this distribution tell us? This tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us a true tree - in other words, how confident we are in our initial analysis, As I mentioned above, this probability of getting the right answer can be broken down along the edge - what are “bootstrap probabilities”.

What can we do instead

Actually, we don’t have the ability to magically generate as many alignment columns as we want, but we can “pretend” what we are doing, simply using the original set of 1000 sites as a pool of sites, from which we draw a new batch of 1000 sites with repetition for each replication. This usually results in a distribution of results other than the true distribution of the sample across 1000 sites, but for large site counts, a good approximation.


* This assumes that the data set was actually generated in accordance with this model - this is something we cannot know for sure if we do not do the simulation. Also, some models, such as uncorrected syntactic dependency, do have paradoxical quality, which under certain conditions, the more sites you have, the less likely it is to restore the correct tree!

+6
source

Bootstrapping is a common statistical technique that has applications outside of bioinformatics. This is a flexible way to handle small samples or samples from a complex population (which, I think, takes place in your application.)

+1
source

Source: https://habr.com/ru/post/899088/


All Articles