Search for all pairs of sequences that differ in one position

I need a data structure representing a set of sequences (all the same, known, length) with the following non-standard operation:

Find two sequences in a set that differ in exactly one index. (Or establish that such a pair does not exist.)

If N is the length of the sequences, and M is the number of sequences, then there is an obvious algorithm O(N*M*M) . I wonder if there is a standard way to solve this more efficiently. I am ready to apply pre-processing if necessary.

Bonus points, if instead of returning a pair, the algorithm returns all sequences that differ at the same point.

Alternatively , I am also interested in a solution in which I can effectively check whether a particular sequence in one index differs from any sequence in a set. If this helps, we can assume that in the set no two sequences have this property.

Edit: N can be considered small enough. By this, I mean that improvements like O(log(N)*M*M) are not immediately useful for my use.

+6
source share
3 answers

For each sequence and each position, in this sequence, calculate the hash of the sequence without position I and add it to the hash table. If the table already has an entry, you have found a potential pair that differs in only one position. Using rolling hashes from start to finish and combining them, you can calculate each hash at a constant time. Expected total run time O (N * M).

+2
source

Choose j sets of k indices each randomly (make sure that none of the sets overlap).

For each set of XOR elements.

You now have fingerprints for each document.

Compare sequences based on these fingerprints. j-1 fingerprints should match if the sequences really match. But the call may be wrong, and you may need to check the location by location.

Additional explanations regarding the comparison: Sort all fingerprints from all documents (or use a hash table). Thus, you do not need to compare each pair, but only pairs that have the corresponding fingerprint.

0
source

Simple recursive approach:

  • Find all sequence sets that have the same first half through sorting or hash.

  • For each of these sets, repeat the entire process, just looking at the other half.

  • Find all sequence sets that have the same second half through sorting or hash.

  • For each of these sets, repeat the whole process, just looking at the first half.

  • When you have reached length 1, all those that do not match are what you are looking for.

Pseudo Code:

 findPairs(1, N) findPairs(set, start, end) mid = (start + end)/2 sort set according to start and mid indices if end - start == 1 last = '' for each seq: set if last != '' and seq != last DONE - PAIR FOUND last = seq else newSet = {} last = '' for each seq: set if newSet.length > 1 and seq and last don't match from start to mid indices findPairs(newSet, mid, end) newSet = {} newSet += seq last = seq 

Just change the code to find all the pairs.

Complexity? I may be wrong, but:

Maximum depth log M (I believe) the worst case will be if all sequences are identical. In this case, the work performed will be O(N*M*log M*log M) , which is better than O(N*M*M) .

0
source

Source: https://habr.com/ru/post/944134/


All Articles