Loop Search

I am looking for the most efficient way to store binary strings in a data structure (insert function), and then when I get a string, I want to check if any looping string of a given string is connected in my structure.

I thought about saving the input strings in Trie, but then, trying to determine if there was some kind of looping string of the string I received was added to Trie to execute | s | searches in Trie for all possible looping strings.

Is there a way to do this more efficiently while the complexity of the place will be similar to Trie?

Note: When I say circular lines of a line, I mean that, for example, all circular lines of 1011 : 0111, 1110, 1101, 1011

+6
source share
2 answers

Can you come up with a canonical function for looping lines based on the following:

  • Find the largest run of zeros.
  • Turn the line so that this zero is in front.
  • For each start of zeros of equal size, see if turning at the front creates a lexicographically smaller line, and if so use it.

This will lead to the canonization of everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically smallest value: 0111.

0101010101 is a thorny instance for which this algorithm will not work well, but if your bits are distributed approximately randomly, it should work well in practice for long strings.

Then you can use a hash based on the canonical form, or use a trie that will only contain an empty string and lines starting with 0, and one trie run will answer your question.

EDIT:

if I have a string of length | s | it may take a long time to find the least lexicographical meaning. How long does it actually take?

That's why I said 010101.... is a value for which it does not work well. Let say that the line has a length n, and the longest run 1 has a length r. If the bits are randomly allocated, the length of the longest path is O (log n) according to the distribution of the longest run .

The search time for the longest run is O (n). You can implement a shift using an offset instead of a buffer copy, which should be O (1). The number of runs in the worst case is O (n / m).

Then the execution time of step 3 should be

  • Find other long runs: one O (n) goes with O (log n) average storage case, O (n) worst case
  • For each run: O (log n) average case, O (n) worst case
  • Shift and comparison lexicographically: O (log n) is the average case, since most comparisons of randomly selected lines fail early, O (n) is the worst case.

This leads to the worst case O (n²), but to the average case O (n + log² n) ≅ O (n).

+5
source

You have n lines s1..sn and a string t is given that you want to know if the cyclic permutation t is a substring of any s1..sn. And you want to keep the lines as efficient as possible. Did I understand your question correctly?

If yes, then this is a solution, but with a long lead time: for a given input t let t '= concat (t, t), check t' with each s in s1..sn to see if the longest subsequence is t 'and sm no less | t | If | si | = k, | t | = l holds in O (nkl) time. And you can store s1..sn in any required data structure. Is it good enough or are you?

0
source

Source: https://habr.com/ru/post/906487/


All Articles