I looked at existing options for regex in Haskell, and I wanted to understand where the performance gap came from comparing various parameters with each other and especially with a simple grep call ...
I have a relatively small (~ 110M, compared to the usual few 10 G in most of my use cases) trace file:
$ du radixtracefile 113120 radixtracefile $ wc -l radixtracefile 1051565 radixtracefile
- At first I tried to find how many matches the (arbitrary) pattern had
.*504.*ll was there through grep:
$ time grep -nE ".*504.*ll" radixtracefile | wc -l 309 real 0m0.211s user 0m0.202s sys 0m0.010s
- I looked at Text.Regex.TDFA (version 1.2.1) with Data.ByteString:
import Control.Monad.Loops import Data.Maybe import qualified Data.Text as T import qualified Data.Text.IO as TIO import Text.Regex.TDFA import qualified Data.ByteString as B main = do f <- B.readFile "radixtracefile" matches :: [[B.ByteString]] <- f =~~ ".*504.*ll" mapM_ (putStrLn . show . head) matches
Create and run:
$ ghc -O2 test-TDFA.hs -XScopedTypeVariables [1 of 1] Compiling Main ( test-TDFA.hs, test-TDFA.o ) Linking test-TDFA ... $ time ./test-TDFA | wc -l 309 real 0m4.463s user 0m4.431s sys 0m0.036s
- Then I looked at Data.Text.ICU.Regex (version 0.7.0.1) with Unicode support:
import Control.Monad.Loops import qualified Data.Text as T import qualified Data.Text.IO as TIO import Data.Text.ICU.Regex main = do re <- regex [] $ T.pack ".*504.*ll" f <- TIO.readFile "radixtracefile" setText re f whileM_ (findNext re) $ do a <- start re 0 putStrLn $ "last match at :"++(show a)
Create and run:
$ ghc -O2 test-ICU.hs [1 of 1] Compiling Main ( test-ICU.hs, test-ICU.o ) Linking test-ICU ... $ time ./test-ICU | wc -l 309 real 1m36.407s user 1m36.090s sys 0m0.169s
I am using ghc version 7.6.3. I have not had occasion to check out other Haskell regex options. I knew that I would not get the performance that I had with grep, and was more than happy with it, but more or less 20 times slower for TDFA and ByteString ... This is very scary. And I canβt understand why this is how I am naive, although it was a shell on my native backend ... Am I somehow using the module incorrectly?
(And we will not mention the ICU + Text combo that goes through the roof)
Is there an option that I have not tested to make me happier?
EDIT :
- Text.Regex.PCRE (version 0.94.4) with Data.ByteString:
import Control.Monad.Loops import Data.Maybe import Text.Regex.PCRE import qualified Data.ByteString as B main = do f <- B.readFile "radixtracefile" matches :: [[B.ByteString]] <- f =~~ ".*504.*ll" mapM_ (putStrLn . show . head) matches
Create and run:
$ ghc -O2 test-PCRE.hs -XScopedTypeVariables [1 of 1] Compiling Main ( test-PCRE.hs, test-PCRE.o ) Linking test-PCRE ... $ time ./test-PCRE | wc -l 309 real 0m1.442s user 0m1.412s sys 0m0.031s
Better, but still with a coefficient of ~ 7-ish ...