Search for all pairs of sequences that differ in one position

Question

Search for all pairs of sequences that differ in one position

I need a data structure representing a set of sequences (all the same, known, length) with the following non-standard operation:

Find two sequences in a set that differ in exactly one index. (Or establish that such a pair does not exist.)

If N is the length of the sequences, and M is the number of sequences, then there is an obvious algorithm O(N*M*M) . I wonder if there is a standard way to solve this more efficiently. I am ready to apply pre-processing if necessary.

Bonus points, if instead of returning a pair, the algorithm returns all sequences that differ at the same point.

Alternatively , I am also interested in a solution in which I can effectively check whether a particular sequence in one index differs from any sequence in a set. If this helps, we can assume that in the set no two sequences have this property.

Edit: N can be considered small enough. By this, I mean that improvements like O(log(N)*M*M) are not immediately useful for my use.

+6

algorithm dynamic-programming

Philippe May 02, '13 at 15:31

source share

3 answers

Choose j sets of k indices each randomly (make sure that none of the sets overlap).

For each set of XOR elements.

You now have fingerprints for each document.

Compare sequences based on these fingerprints. j-1 fingerprints should match if the sequences really match. But the call may be wrong, and you may need to check the location by location.

Additional explanations regarding the comparison: Sort all fingerprints from all documents (or use a hash table). Thus, you do not need to compare each pair, but only pairs that have the corresponding fingerprint.

0

Elkamina May 02, '13 at 15:37

source share

Simple recursive approach:

Find all sequence sets that have the same first half through sorting or hash.
For each of these sets, repeat the entire process, just looking at the other half.
Find all sequence sets that have the same second half through sorting or hash.
For each of these sets, repeat the whole process, just looking at the first half.
When you have reached length 1, all those that do not match are what you are looking for.

Pseudo Code:

 findPairs(1, N) findPairs(set, start, end) mid = (start + end)/2 sort set according to start and mid indices if end - start == 1 last = '' for each seq: set if last != '' and seq != last DONE - PAIR FOUND last = seq else newSet = {} last = '' for each seq: set if newSet.length > 1 and seq and last don't match from start to mid indices findPairs(newSet, mid, end) newSet = {} newSet += seq last = seq

Just change the code to find all the pairs.

Complexity? I may be wrong, but:

Maximum depth log M (I believe) the worst case will be if all sequences are identical. In this case, the work performed will be O(N*M*log M*log M) , which is better than O(N*M*M) .

0

Dukeling May 02, '13 at 20:24

source share

Falk hüffner · Accepted Answer · 2013-05-03T12:02:17+0000

For each sequence and each position, in this sequence, calculate the hash of the sequence without position I and add it to the hash table. If the table already has an entry, you have found a potential pair that differs in only one position. Using rolling hashes from start to finish and combining them, you can calculate each hash at a constant time. Expected total run time O (N * M).

Search for all pairs of sequences that differ in one position

More articles: