Fastest way to check if string matches regex in ruby?

What is the fastest way to check if a string matches a regex in Ruby?

My problem is that I have to "egrep" through a huge list of strings to find those that match the regular expression given at runtime. I only care about whether the string matches the regular expression, and not by its coincidence, and what content the group matches. . I hope this assumption can be used to reduce the time taken by my regexp code.

I load regexp with

pattern = Regexp.new(ptx).freeze 

I found that string =~ pattern slightly faster than string.match(pattern) .

Are there other tricks or shortcuts that can be used to speed up this test?

+75
performance ruby regex
Aug 09 2018-12-12T00:
source share
7 answers

Starting with Ruby 2.4.0, can you use RegExp#match? :

 pattern.match?(string) 

RegExp#match? explicitly indicated as performance improvements in the release notes for version 2.4.0 , since it avoids the distribution of objects executed by other methods such as Regexp#match and =~ :

Regexp # match?
Added RegExp#match? that matches the regular expression without creating a backreference object and changing $~ to reduce the selection of objects.

+79
Mar 16 '17 at 12:30
source share
— -

This is a simple test:

 require 'benchmark' "test123" =~ /1/ => 4 Benchmark.measure{ 1000000.times { "test123" =~ /1/ } } => 0.610000 0.000000 0.610000 ( 0.578133) "test123"[/1/] => "1" Benchmark.measure{ 1000000.times { "test123"[/1/] } } => 0.718000 0.000000 0.718000 ( 0.750010) irb(main):019:0> "test123".match(/1/) => #<MatchData "1"> Benchmark.measure{ 1000000.times { "test123".match(/1/) } } => 1.703000 0.000000 1.703000 ( 1.578146) 

So =~ is faster, but it depends on what you want to have as the return value. If you just want to check if the text contains a regular expression or not use =~

+70
Aug 09 '12 at 17:03
source share

This is a benchmark that I met after I found some articles on the net.

Since 2.4.0, the winner of re.match?(str) (as suggested by @ wiktor-stribiżew), in previous versions re =~ str seems to be the fastest, although str =~ re works almost as fast.

 #!/usr/bin/env ruby require 'benchmark' str = "aacaabc" re = Regexp.new('a+b').freeze N = 4_000_000 Benchmark.bm do |b| b.report("str.match re\t") { N.times { str.match re } } b.report("str =~ re\t") { N.times { str =~ re } } b.report("str[re] \t") { N.times { str[re] } } b.report("re =~ str\t") { N.times { re =~ str } } b.report("re.match str\t") { N.times { re.match str } } if re.respond_to?(:match?) b.report("re.match? str\t") { N.times { re.match? str } } end end 

MRI 1.9.3-o551 Results:

 $ ./bench-re.rb | sort -t $'\t' -k 2 user system total real re =~ str 2.390000 0.000000 2.390000 ( 2.397331) str =~ re 2.450000 0.000000 2.450000 ( 2.446893) str[re] 2.940000 0.010000 2.950000 ( 2.941666) re.match str 3.620000 0.000000 3.620000 ( 3.619922) str.match re 4.180000 0.000000 4.180000 ( 4.180083) 

MRI 2.1.5 Results:

 $ ./bench-re.rb | sort -t $'\t' -k 2 user system total real re =~ str 1.150000 0.000000 1.150000 ( 1.144880) str =~ re 1.160000 0.000000 1.160000 ( 1.150691) str[re] 1.330000 0.000000 1.330000 ( 1.337064) re.match str 2.250000 0.000000 2.250000 ( 2.255142) str.match re 2.270000 0.000000 2.270000 ( 2.270948) 

MRI 2.3.3 results (there seems to be a regression in regular expression matching):

 $ ./bench-re.rb | sort -t $'\t' -k 2 user system total real re =~ str 3.540000 0.000000 3.540000 ( 3.535881) str =~ re 3.560000 0.000000 3.560000 ( 3.560657) str[re] 4.300000 0.000000 4.300000 ( 4.299403) re.match str 5.210000 0.010000 5.220000 ( 5.213041) str.match re 6.000000 0.000000 6.000000 ( 6.000465) 

MRI 2.4.0 Results:

 $ ./bench-re.rb | sort -t $'\t' -k 2 user system total real re.match? str 0.690000 0.010000 0.700000 ( 0.682934) re =~ str 1.040000 0.000000 1.040000 ( 1.035863) str =~ re 1.040000 0.000000 1.040000 ( 1.042963) str[re] 1.340000 0.000000 1.340000 ( 1.339704) re.match str 2.040000 0.000000 2.040000 ( 2.046464) str.match re 2.180000 0.000000 2.180000 ( 2.174691) 
+39
Aug 10 '12 at 19:41
source share

How about re === str (case comparison)?

Since it evaluates to true or false and does not need to save matches, return a match index and that material, I wonder if it will be an even faster way to match than =~ .




Ok, I checked it. =~ is still faster, even if you have several capture groups, however it is faster than other options.

By the way, what's a good freeze ? I could not measure the performance gain.

+7
May 10 '13 at 21:23
source share

Depending on how complex your regular expression is, you can simply use a simple section of strings. I'm not sure if this is practical for your application or whether it will really offer any speed improvements.

 'testsentence'['stsen'] => 'stsen' # evaluates to true 'testsentence'['koala'] => nil # evaluates to false 
+4
Aug 09 '12 at 15:56
source share

I am wondering if there is any weird way to make this check even faster, perhaps using some weird method in Regexp or some weird construct.

Regexp engines differ in how they implement the search, but generally tie your patterns to speed and avoid greedy matches, especially when searching for long strings.

The best thing to do while you are not familiar with how a particular engine works is to do tests and add / remove bindings, try to limit the search, use wildcards against explicit matches, etc.

Fruity gemstone is very useful for quick benchmarking because it is smart. Ruby, the built-in benchmark , is also useful, although you can write tests that trick you without being careful.

I used as many answers here as in the Stack Overflow section, so you can search my answers and get many small tricks and results to give you an idea of ​​how to write faster code.

The most important thing to remember is that it is bad to prematurely optimize your code before you know where the slowdown occurs.

+3
03 Oct '14 at 23:51
source share

To complete the answers of Victor Stribiev and Dougie, I would say that /regex/.match?("string") about as fast as "string".match?(/regex/) .

Ruby 2.4.0 (10,000,000 ~ 2 sec)

 2.4.0 > require 'benchmark' => true 2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } } => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } } => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007> 

Ruby 2.6.2 (100,000,000 ~ 20 sec)

 irb(main):001:0> require 'benchmark' => true irb(main):005:0> Benchmark.measure{ 100000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } } => #<Benchmark::Tms:0x0000562bc83e3768 @label="", @real=24.60139879199778, @cstime=0.0, @cutime=0.0, @stime=0.010000999999999996, @utime=24.565644999999996, @total=24.575645999999995> irb(main):004:0> Benchmark.measure{ 100000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } } => #<Benchmark::Tms:0x0000562bc846aee8 @label="", @real=24.634255946999474, @cstime=0.0, @cutime=0.0, @stime=0.010046, @utime=24.598276, @total=24.608321999999998> 

Note: times change, sometimes /regex/.match?("string") faster, and sometimes "string".match?(/regex/) , the differences can only be associated with "string".match?(/regex/) a computer.

0
Dec 04 '17 at 22:08
source share



All Articles