Can you come up with a canonical function for looping lines based on the following:
- Find the largest run of zeros.
- Turn the line so that this zero is in front.
- For each start of zeros of equal size, see if turning at the front creates a lexicographically smaller line, and if so use it.
This will lead to the canonization of everything in the equivalence class (1011, 1101, 1110, 0111) to the lexicographically smallest value: 0111.
0101010101 is a thorny instance for which this algorithm will not work well, but if your bits are distributed approximately randomly, it should work well in practice for long strings.
Then you can use a hash based on the canonical form, or use a trie that will only contain an empty string and lines starting with 0, and one trie run will answer your question.
EDIT:
if I have a string of length | s | it may take a long time to find the least lexicographical meaning. How long does it actually take?
That's why I said 010101.... is a value for which it does not work well. Let say that the line has a length n, and the longest run 1 has a length r. If the bits are randomly allocated, the length of the longest path is O (log n) according to the distribution of the longest run .
The search time for the longest run is O (n). You can implement a shift using an offset instead of a buffer copy, which should be O (1). The number of runs in the worst case is O (n / m).
Then the execution time of step 3 should be
- Find other long runs: one O (n) goes with O (log n) average storage case, O (n) worst case
- For each run: O (log n) average case, O (n) worst case
- Shift and comparison lexicographically: O (log n) is the average case, since most comparisons of randomly selected lines fail early, O (n) is the worst case.
This leads to the worst case O (n²), but to the average case O (n + log² n) ≅ O (n).
source share