Regular expressions in the R base: 'perl = TRUE' by default (PCRE vs. TRE)

When using the basic functions of the R line, such as gsub and grep , is there a drawback, as usual, always indicating perl = TRUE ? Are there any flaws in this?

With perl=TRUE expressions can do more things (for example, you can use the look or look for statements, or you can convert case with \\U ), and performance is also faster because the state documentation.

So are there any flaws? Is perl = TRUE not standard for backward compatibility? Are there any portability issues that I should know about when perl = TRUE?

+5
source share
3 answers

Preliminary considerations

It is not recommended to compare apples with oranges, because the PCRE regular expression can do much more than the TRE regular expression that does not support search queries, return verbs, code change operators in pluggable templates (in fact, this is an extension of the PCRE library used in R ) Also, when you use . in TRE regular expression patterns, remember that it also matches line breaks and PCRE patterns to do . matching line breaks, you need to use the inline DOTALL modifier (?s) before . which must match any characters, including line breaks (or (?s:.*) as a group of modifiers).

If we want to compare the performance of TRE and PCRE regular expressions in R, we must use simple patterns that correspond to literally the same texts with these two engines.

Performance tests in Windows 7, Linux Ubuntu 16.04, MacOS Sierra 10.12.6

I mainly use R on Windows, but I installed R 3.2.3 on a Linux virtual machine specifically for this testing. MacOS results are taken from t.kalinowski's answer.

Compare the results of the regular expressions TRE (default) and PCRE ( perl=TRUE ) using the library of microobjects (see more benchmarking parameters in R ):

 library(microbenchmark) 

Text Wikipedia article on butterflies .

 txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago." 

Try extracting the last text in parentheses with sub , a very common sub operation in R:

 # sub('.*\\((.*)\\).*', '\\1', txt) # => [1] "formerly the superfamily \"Hedyloidea\"" PCRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt, perl=TRUE)) } TRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt)) } test <- microbenchmark( PCRE_1(txt), TRE_1(txt), times = 500000 ) test 

The results are as follows:

 WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 163.607 165.418 168.65393 166.625 167.229 7314.588 5e+05 TRE_1(txt) 70.031 72.446 74.53842 73.050 74.257 38026.680 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05 TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981 7702.869 5e+05 Linux ------ Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 10.557 11.555 13.78216 12.097 12.662 4301.178 5e+05 TRE_1(txt) 25.875 27.350 31.51925 27.805 28.737 17974.716 5e+05 

TRE regex sub only wins on Windows , more than 2 times faster. On MacOS and Linux, the PCRE version ( perl=TRUE ) wins with the same odds.

Now, let's compare the performance of regular expressions that do not heavily use backtrace, and extract words inside double quotes:

 # regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt)) # => [1] "\"Hesperioidea\"" "\"Hedyloidea\"" PCRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE))) } TRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))) } test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 ) test WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 324.799 330.232 349.0281 332.646 336.269 124404.14 5e+05 TRE_2(txt) 187.755 191.981 204.7663 193.792 196.208 74554.94 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05 TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05 LINUX ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 30.199 34.750 44.05169 36.151 43.403 38428.2 5e+05 TRE_2(txt) 37.752 41.854 52.58230 43.409 51.781 38915.7 5e+05 

The best average is PCRE regular expression on Linux, on MacOS, the difference is almost careless, and on Windows, TRE is much faster.

Summary

It's clear that the TRE regex library (by default) is much faster on Windows . On Linux , the PCRE regular expression is much faster. On MacOS , PCRE regex is still preferable because with the backtracking patterns, PCRE regex is faster than TRE on this OS.

+2
source

Running tests on @ wiktor-stribiżew, I get a different result from it. In the first test, the PCRE engine is faster than TRE (i.e. perl=TRUE faster). With the second benchmark, there is no significant difference in performance between PCRE or TRE.

They were run on R version 3.4.2 (2017-09-28), macOS Sierra 10.12.6, i7-2675QM CPU @ 2.20GHz

 ``` txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago." library(microbenchmark) PCRE_1 <- function(text) sub('.*\\((.*)\\).*', '\\1', txt, perl=TRUE) TRE_1 <- function(text) sub('.*\\((.*)\\).*', '\\1', txt) (test <- microbenchmark( PCRE_1(txt), TRE_1(txt), times = 500000 )) #> Unit: microseconds #> expr min lq mean median uq max neval #> PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05 #> TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981 7702.869 5e+05 PCRE_2 <- function(text) regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE)) TRE_2 <- function(text) regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt)) (test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 )) #> Unit: microseconds #> expr min lq mean median uq max neval #> PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05 #> TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05 ``` 
0
source

My results are Ubuntu 16.04, - Perl is faster, see below.

 Unit: microseconds expr min lq mean median uq max neval cld PCRE_1(txt) 8.949 9.809 11.16 10.18 10.62 135299 5e+05 a TRE_1(txt) 23.816 24.805 26.84 25.23 26.17 5433 5e+05 b Unit: microseconds expr min lq mean median uq max neval cld PCRE_2(txt) 26.97 30.96 37.32 32.19 35.06 243164 5e+05 a TRE_2(txt) 33.75 38.07 44.50 39.40 43.33 35632 5e+05 b Session info ----------------------------------------------------------------- setting value version R version 3.4.2 (2017-09-28) system x86_64, linux-gnu ui RStudio (1.1.383) language en collate en_US.UTF-8 tz Europe/Berlin date 2017-11-12 Linux 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux model name : Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz stepping : 3 microcode : 0x9 cpu MHz : 3647.929 cache size : 8192 KB 
0
source

Source: https://habr.com/ru/post/1273285/


All Articles