I am trying to find the fastest way to find substrings in Text
String. Here is the desired result:
findSubstringIndices :: Text -> Text -> [Int]
findSubstringIndices "asdfasdf" "as" == [0, 4] -- 0-indexed
findSubstringIndices "asdasdasdasd" "asdasd" == [0, 3, 6] -- matches can overlap
In my application, the substring is a fixed 6-letter word, but the search string is very long (even more than 3 billion letters). My current approach uses the package KMP
:
import Data.Text.Lazy as T
import Data.Algorithms.KMP as KMP
findSubstringIndices a b = KMP.match (KMP.build $ T.unpack b) $ T.unpack a
But this seems like a huge loss of compactness from Text
. Is there a (preferably concise) way to do this without unpack
ing?
I know there is a function called breakOnAll
in Text
, however this does not meet my requirement to allow matching matches.
Edit: at the suggestion of @ReidBarton, I implemented a version that does not need unpack
, which is really faster. However, I'm not sure if this is the fastest.
findSubstringIndicesC t a b = let (l, r) = T.breakOn b a in case r of
"" -> []
_ -> T.length l : findSubstringIndicesC (t + T.length l + 1) (T.tail r) b
findSubstringIndices = findSubstringIndicesC 0