Preliminary considerations
It is not recommended to compare apples with oranges, because the PCRE regular expression can do much more than the TRE regular expression that does not support search queries, return verbs, code change operators in pluggable templates (in fact, this is an extension of the PCRE library used in R ) Also, when you use . in TRE regular expression patterns, remember that it also matches line breaks and PCRE patterns to do . matching line breaks, you need to use the inline DOTALL modifier (?s) before . which must match any characters, including line breaks (or (?s:.*) as a group of modifiers).
If we want to compare the performance of TRE and PCRE regular expressions in R, we must use simple patterns that correspond to literally the same texts with these two engines.
Performance tests in Windows 7, Linux Ubuntu 16.04, MacOS Sierra 10.12.6
I mainly use R on Windows, but I installed R 3.2.3 on a Linux virtual machine specifically for this testing. MacOS results are taken from t.kalinowski's answer.
Compare the results of the regular expressions TRE (default) and PCRE ( perl=TRUE ) using the library of microobjects (see more benchmarking parameters in R ):
library(microbenchmark)
Text Wikipedia article on butterflies .
txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago."
Try extracting the last text in parentheses with sub , a very common sub operation in R:
# sub('.*\\((.*)\\).*', '\\1', txt)
The results are as follows:
WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 163.607 165.418 168.65393 166.625 167.229 7314.588 5e+05 TRE_1(txt) 70.031 72.446 74.53842 73.050 74.257 38026.680 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05 TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981 7702.869 5e+05 Linux ------ Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 10.557 11.555 13.78216 12.097 12.662 4301.178 5e+05 TRE_1(txt) 25.875 27.350 31.51925 27.805 28.737 17974.716 5e+05
TRE regex sub only wins on Windows , more than 2 times faster. On MacOS and Linux, the PCRE version ( perl=TRUE ) wins with the same odds.
Now, let's compare the performance of regular expressions that do not heavily use backtrace, and extract words inside double quotes:
# regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt)) # => [1] "\"Hesperioidea\"" "\"Hedyloidea\"" PCRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE))) } TRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))) } test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 ) test WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 324.799 330.232 349.0281 332.646 336.269 124404.14 5e+05 TRE_2(txt) 187.755 191.981 204.7663 193.792 196.208 74554.94 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05 TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05 LINUX ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 30.199 34.750 44.05169 36.151 43.403 38428.2 5e+05 TRE_2(txt) 37.752 41.854 52.58230 43.409 51.781 38915.7 5e+05
The best average is PCRE regular expression on Linux, on MacOS, the difference is almost careless, and on Windows, TRE is much faster.
Summary
It's clear that the TRE regex library (by default) is much faster on Windows . On Linux , the PCRE regular expression is much faster. On MacOS , PCRE regex is still preferable because with the backtracking patterns, PCRE regex is faster than TRE on this OS.