I have a set of S strings generated from DNA sequencing using a specific adapter fragment. This means that all lines in S contain a suffix that approximately matches (due to sequence errors) the adapter sequence prefix. How can I, given only the set S, deduce the most probable sequence of adapters used to generate S?
The set S is very large - approximately 1 million fragments, each of which has a length of 50 characters. I know that creating a generalized suffix tree over the set S will help a lot in this problem, but I'm not sure of a method to use the most likely sequence of adapters.
source
share