How to find substrings in the text very quickly?

I am trying to find the fastest way to find substrings in TextString. Here is the desired result:

findSubstringIndices :: Text -> Text -> [Int]
findSubstringIndices "asdfasdf" "as" == [0, 4]  -- 0-indexed
findSubstringIndices "asdasdasdasd" "asdasd" == [0, 3, 6]  -- matches can overlap

In my application, the substring is a fixed 6-letter word, but the search string is very long (even more than 3 billion letters). My current approach uses the package KMP:

import Data.Text.Lazy as T
import Data.Algorithms.KMP as KMP
findSubstringIndices a b = KMP.match (KMP.build $ T.unpack b) $ T.unpack a

But this seems like a huge loss of compactness from Text. Is there a (preferably concise) way to do this without unpacking?

I know there is a function called breakOnAllin Text, however this does not meet my requirement to allow matching matches.


Edit: at the suggestion of @ReidBarton, I implemented a version that does not need unpack, which is really faster. However, I'm not sure if this is the fastest.

findSubstringIndicesC t a b = let (l, r) = T.breakOn b a in case r of
    "" -> []
    _  -> T.length l : findSubstringIndicesC (t + T.length l + 1) (T.tail r) b

findSubstringIndices = findSubstringIndicesC 0
+4
1

Data.ByteString.Search , Boyer-Moore , DFA, , . Text . Text , , . .

+1

Source: https://habr.com/ru/post/1610337/


All Articles