Sequence smoothing

Question

Sequence smoothing

I think there needs to be an algorithm for this - probably in a field like bioinformatics (the problem reminds me a bit of sequence alignment), so I hope someone can help me here.

The problem is this: suppose I split some data into two different classes X and Y. The result of this might look something like this: .. XXX Y XXX .. Next, suppose we have some knowledge of domain classes and know that it is highly unlikely to have less than a certain number of instances per row (i.e. it is unlikely that there will be less than 4 X or Y in the sequence) - it is desirable that I can use a different threshold for each class, but this is not necessary). Therefore, if we use this domain knowledge, it is “obvious” that we would like to replace the only Y in the middle with X.

Thus, the algorithm should take a sequence of secret instances and threshold values for classes (or 1 threshold for all if it simplifies the problem) and try to find a sequence that satisfies the property (no class sequences are shorter than the given threshold). Obviously, there can be a very large number of correct solutions (for example, in the above example, we could also replace all X with Y), so I think that minimizing the number of replacements would be a reasonable criterion for optimization.

I don’t need a particularly efficient algorithm here, since the number of instances will be quite small (say <4k), and we will only have two classes. In addition, since this is obviously only a heuristic, I am fine with some inaccuracies if they greatly simplify the algorithm.

+6

algorithm

Voo Jun 06 '11 at 14:12

source share

2 answers

A very similar problem with this can be solved as a classic problem of the shortest path of dynamic programming. We want to find a sequence that minimizes some concept of value. Mark each character in a sequence that is different from the corresponding character in the original sequence. Draw each change of a character in a sequence, so punish each change from X to Y and vice versa.

This is not exactly what you want, because the penalty for YYYXYYY is the same as the penalty for YXXXXXXY - one penalty for YX and one for XY - however this can be a good approximation because, for example, if the base sequence says YYY. ... YXY .... YY, then it will be cheaper to change the central X to Y than to pay the cost of XY and YX - and you can obviously play with various penalties for the cost until you get something that looks plausible.

Now think of each position in the sequence as two points, one above the other, one point representing “X goes here” and one of them “Y goes here”. You can associate points with cost lines, depending on whether the corresponding X or Y character matches in the original sequence and whether the XX or X line matches Y or so on. Then follow the shortest path from left to right using a dynamic program that will work out the best paths ending in X and Y at position i + 1, given the knowledge of the best paths ending in X and Y at position i.

If you really want to penalize short-term changes more severely than long-term changes, you can probably do this by increasing the number of points in the path search view - you will have points corresponding to “X here and last“ Y was 3 characters back. ”But depending on what you want to get for a penalty, you can get an unreasonably large number of points for each character.

+2

mcdowella Jun 06 '11 at 18:04

source share

jonderry · Accepted Answer · 2011-06-06T18:29:49+0000

You can use dynamic programming, as in the following pseudo-code sketch (for simplicity, this code assumes that the threshold is 3 Xs or Ys per line, not 4):

min_switch(s): n = len(s) optx = array(4, n, infinity) // initialize all values to infinity opty = array(4, n, infinity) // initialize all values to infinity if s[0] == 'X': optx[1][0] = 0 opty[1][0] = 1 else: optx[1][0] = 1 opty[1][0] = 0 for i in {1, n - 1}: x = s[i] if x == 'X': optx[1][i] = opty[3][i - 1] optx[2][i] = optx[1][i - 1] optx[3][i] = min(optx[2][i - 1], optx[3][i - 1]) opty[1][i] = 1 + min(optx[1][i - 1], optx[2][i - 1], optx[3][i - 1]) opty[2][i] = 1 + opty[1][i - 1] opty[3][i] = 1 + min(opty[2][i - 1], opty[3][i - 1]) else: optx[1][i] = 1 + min(opty[1][i - 1], opty[2][i - 1], opty[3][i - 1]) optx[2][i] = 1 + opty[1][i - 1] optx[3][i] = 1 + min(opty[2][i - 1], opty[3][i - 1]) opty[1][i] = optx[3][i - 1] opty[2][i] = opty[1][i - 1] opty[3][i] = min(opty[2][i - 1], opty[3][i - 1]) return min(optx[3][n - 1], opty[3][n - 1])

The above code essentially calculates the lowest cost to create a smooth sequence up to the i-th character, while maintaining the optimal value for all the corresponding numbers of consecutive Xs or Ys in a line (1, 2 or 3 lines). More formally

opt[i][0][k] stores the lowest cost of converting the string s[0...k] into a smooth sequence, then i consecutive Xs ends. Runs of 3 or more are counted in opt[3][0][k] .
opt[0][j][k] stores the lowest cost of converting the string s[0...k] into a smooth sequence, then j consecutive Ys ends. Runs of 3 or more are counted in opt[0][3][k] .

Directly convert this to an algorithm that returns the sequence as well as the optimal cost.

Please note that some of the cases in the above code are probably not needed, it's just a direct repetition derived from restrictions.

Sequence smoothing

More articles: