Regex vs brute force for small strings

Question

Regex vs brute force for small strings

When testing small strings (for example, isPhoneNumber or isHexadecimal), is there a performance advantage from using regular expressions or is it possible to adjust their livestock faster? Would sorrow force them by simply checking to see if given string characters in a given range would be faster than using a regular expression?

For instance:

public static boolean isHexadecimal(String value) { if (value.startsWith("-")) { value = value.substring(1); } value = value.toLowerCase(); if (value.length() <= 2 || !value.startsWith("0x")) { return false; } for (int i = 2; i < value.length(); i++) { char c = value.charAt(i); if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f')) { return false; } } return true; }

vs.

 Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression

It seems that some regular expressions will be associated with a regular expression, even if the pattern is precompiled, simply because regular expressions should work in many general cases. On the contrary, the brute force method does exactly what is required, and nothing more. Is any regular expression optimization missing?

+5

performance string regex brute-force

Braden steffaniak Oct 23 '16 at 18:26

source share

8 answers

Checking whether string characters match a specific range exactly matches regular expressions. They transform expression into an atomic series of instructions; They essentially write your manual steps, but at a lower level.

What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see a real performance boost when the regex is used more than once. This is when you can compile an expression ahead of time, and then simply apply the resulting compiled instructions to match, search, replace, etc.

As with performance, perform some tests and measure results .

+8

Soviut Oct 23 '16 at 18:34

source share

The brute force method to solve the problem consists in systematically testing all combinations. This is none of your business.

You can get better performance from a manual procedure. You can take advantage of data distribution if you know this beforehand. Or you can make some smart shortcuts that apply to your case. But it’s not really guaranteed that what you write will be automatically faster than the regular expression. The implementation is also optimized, and you can easily get code that is worse than this.

The code in your question is really nothing special and, most likely, it will be on a par with regular expression. When I tested this, there was no clear winner, sometimes one was faster, sometimes the other, the difference was small. Your time is limited, think wisely where you spend it.

+4

Antonín Lejsek Oct 23 '16 at 10:58

source share

You are abusing the term brute force. The best term is custom mapping.

Regular expression interpreters are usually slower than custom patterns. The regular expression is compiled into byte code, and compilation takes time. Even ignoring compilation (this may be good if you compile only once and match a very long line and / or many times, therefore the compilation cost is not important), the machine instructions spent in the corresponding interpreter are overheads that the user compiler does not execute there is.

In cases where the regex match wins, it is usually that the regex engine is implemented in very fast native code, while the custom layout is written to something slower.

Now you can compile regular expressions into your own code, which works as fast as well-executed custom matches. This is an approach, for example, lex / flex and others. But the most common libraries or embedded languages do not use this approach (Java, Python, Perl, etc.). They use translators.

Native code-generating libraries tend to be cumbersome to use, with the possible exception of C / C ++, where they have been part of the air for decades.

In other languages, I am a fan of state cars. For me, they are easier to understand and get the right solutions than regular expressions or user mappings. Below is one of your concerns. State 0 is the initial state, and D is the sixth digit.

Implementation of the machine can be very fast. In Java, it might look like this:

 static boolean isHex(String s) { int state = 0; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); switch (state) { case 0: if (c == '-') state = 1; else if (c == '0') state = 2; else return false; break; case 1: if (c == '0') state = 2; else return false; break; case 2: if (c == 'x') state = 3; else return false; break; case 3: if (isHexDigit(c)) state = 4; else return false; break; case 4: if (isHexDigit(c)) ; // state already = 4 else return false; break; } } return true; } static boolean isHexDigit(char c) { return '0' <= c && c <= '9' || 'A' <= c && c <= 'F' || 'a' <= c && c <= 'f'; }

The code is not very short, but it is a direct translation of the diagram. There is nothing to be confused with simple typographical errors.

In C, you can implement states as goto labels:

 int isHex(char *s) { char c; s0: c = *s++; if (c == '-') goto s1; if (c == '0') goto s2; return 0; s1: c = *s++; if (c == '0') goto s2; return 0; s2: c = *s++; if (c == 'x') goto s3; return 0; s3: c = *s++; if (isxdigit(c)) goto s4; return 0; s4: c = *s++; if (isxdigit(c)) goto s4; if (c == '\0') return 1; return 0; }

This type of goto server, written in C, is usually the fastest I've seen. On my MacBook, using the old gcc (4.6.4), it compiles with only 35 machine instructions.

+4

Gene Nov 05 '16 at 4:37

source share

Usually what is best depends on your goals. If readability is the main goal (what it should be if you don't find a performance issue), then the regex is just fine.

If performance is your goal, you first need to analyze the problem. For instance. if you know this is either a phone number or a hexadecimal number (and nothing more), then the problem becomes much simpler.

Now let's look at your function (in terms of performance) for determining hexadecimal numbers:

Getting a substring is bad (creating a new object in general), it is better to work with the index and promote it.
Instead of using toLower (), it is better to compare upper and lower case letters (the line is repeated only once, unnecessary replacements are not performed and a new object is not created).

Thus, a performance-optimized version might look something like this (perhaps you can optimize it using charArray instead of a string):

 public static final boolean isHexadecimal(String value) { if (value.length() < 3) return false; int idx; if (value.charAt(0) == '-' || value.charAt(0) == '+') { // also supports unary plus if (value.length() < 4) // necessairy because -0x and +0x are not valid return false; idx = 1; } else { idx = 0; } if (value.chartAt(idx) != '0' || value.charAt(idx + 1) != 'x') return false; for (idx += 2; idx < value.length(); idx++) { char c = value.charAt(idx); if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) return false; } return true; }

+1

maraca Oct 30 '16 at 13:29

source share

Well-implemented regular expressions can be faster than a naive brute-force implementation of the same pattern. On the other hand, you can always implement a faster solution for a specific case. Also, as indicated in the article above, most implementations in popular languages are inefficient (in some cases).

I would implement my own solutions only when performance is an absolute priority and with extensive testing and profiling.

0

Argb32 Oct 31 '16 at 14:50

source share

To get better performance than naive manual authentication, you can use a regex library based on deterministic automata, for example. Brics automaton

I wrote a short jmh test:

 @State(Scope.Thread) public abstract class MatcherBenchmark { private String longHexText; @Setup public void setup() { initPattern("0x[0-9a-fA-F]+"); this.longHexText = "0x123fa"; } public abstract void initPattern(String pattern); @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Warmup(iterations = 10) @Measurement(iterations = 10) @Fork(1) public void benchmark() { boolean result = benchmark(longHexText); if (!result) { throw new RuntimeException(); } } public abstract boolean benchmark(String text); @TearDown public void tearDown() { donePattern(); this.longHexText = null; } public abstract void donePattern(); }

and implemented it with:

 @Override public void initPattern(String pattern) { RegExp r = new RegExp(pattern); this.automaton = new RunAutomaton(r.toAutomaton(true)); } @Override public boolean benchmark(String text) { return automaton.run(text); }

I also created tests for Zeppelins, Genes and a compiled java.util.Regex solution, as well as a solution with rexlex . These are the results of the jmh test on my machine:

 BricsMatcherBenchmark.benchmark avgt 10 0,014   0,001 us/op GenesMatcherBenchmark.benchmark avgt 10 0,017   0,001 us/op JavaRegexMatcherBenchmark.benchmark avgt 10 0,097   0,005 us/op RexlexMatcherBenchmark.benchmark avgt 10 0,061   0,002 us/op ZeppelinsBenchmark.benchmark avgt 10 0,008   0,001 us/op

Running the same test with a non-hexadecimal digit 0x123fax leads to the following results (note: I inverted the check in benchmark for this test)

 BricsMatcherBenchmark.benchmark avgt 10 0,015   0,001 us/op GenesMatcherBenchmark.benchmark avgt 10 0,019   0,001 us/op JavaRegexMatcherBenchmark.benchmark avgt 10 0,102   0,001 us/op RexlexMatcherBenchmark.benchmark avgt 10 0,052   0,002 us/op ZeppelinsBenchmark.benchmark avgt 10 0,009   0,001 us/op

0

Corona Nov 05 '16 at 5:27

source share

Regex has huge advantages, but Regex has performance issues.

-2

Vishnu ranganathan Oct 23 '16 at 18:59

source share

zeppelin · Accepted Answer · 2016-11-03T11:39:48+0000

I wrote a small benchmark to measure performance:

NOP method (to get an idea of the basic iterative speed);
The original method provided by OP;
RegExp;
Compiled Regexp;
Version provided by @maraca (without toLowerCase and substring);
The "fastIsHex" version (based on the switch), I added just for fun.

The configuration of the test apparatus is as follows:

JVM: Java (TM) SE runtime (version 1.8.0_101-b13)
Processor: Intel (R) Core (TM) i5-2500 CPU @ 3.30 GHz

And here are the results that I got for the original test string "0x123fa" and 10,000,000 iterations:

 Method "NOP" => #10000000 iterations in 9ms Method "isHexadecimal (OP)" => #10000000 iterations in 300ms Method "RegExp" => #10000000 iterations in 4270ms Method "RegExp (Compiled)" => #10000000 iterations in 1025ms Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms Method "fastIsHex" => #10000000 iterations in 107ms

as you can see that the original OP method is faster than the RegExp method (at least when using the RegExp implementation provided by the JDK).

(for your reference)

Verification Code:

 public static void main(String[] argv) throws Exception { //Number of ITERATIONS final int ITERATIONS = 10000000; //NOP benchmark(ITERATIONS,"NOP",() -> nop(longHexText)); //isHexadecimal benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText)); //Un-compiled regexp benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+")); //Pre-compiled regexp final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+"); benchmark(ITERATIONS,"RegExp (Compiled)", () -> { pattern.matcher(longHexText).matches(); }); //isHexadecimal (maraca) benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText)); //FastIsHex benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText)); } public static void benchmark(int iterations,String name,Runnable block) { //Start Time long stime = System.currentTimeMillis(); //Benchmark for(int i = 0; i < iterations; i++) { block.run(); } //Done System.out.println( String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime)) ); }

NOP Method:

 public static boolean nop(String value) { return true; }

fastIsHex method:

 public static boolean fastIsHex(String value) { //Value must be at least 4 characters long (0x00) if(value.length() < 4) { return false; } //Compute where the data starts int start = ((value.charAt(0) == '-') ? 1 : 0) + 2; //Check prefix if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') { return false; } //Verify data for(int i = start; i < value.length(); i++) { switch(value.charAt(i)) { case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9': case 'a':case 'b':case 'c':case 'd':case 'e':case 'f': case 'A':case 'B':case 'C':case 'D':case 'E':case 'F': continue; default: return false; } } return true; }

So, the answer is no, for short lines and tasks at hand RegExp is not faster.

When it comes to longer strings, the balance is completely different, below are the results for the long hexadecimal string 8192 generated with:

 hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom

and 10,000 iterations:

 Method "NOP" => #10000 iterations in 2ms Method "isHexadecimal (OP)" => #10000 iterations in 1512ms Method "RegExp" => #10000 iterations in 1303ms Method "RegExp (Compiled)" => #10000 iterations in 1263ms Method "isHexadecimal (maraca)" => #10000 iterations in 553ms Method "fastIsHex" => #10000 iterations in 530ms

As you can see, the handwritten methods (Makara and my fastIsHex) still beat RegExp, but the original method does not, (due to the substring () and toLowerCase ()).

Sidenote:

This test is very simple and only checks the “worst case” scenario (i.e. a fully valid string), real-life results with mixed data lengths and an invalid invalid ratio can be completely different.

Update:

I also tried the char [] array version:

  char[] chars = value.toCharArray(); for (idx += 2; idx < chars.length; idx++) { ... }

and it was even a little slower than the getCharAt (i) version:

  Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms

My assumption is that due to copying the array internally in a chararray.

Update (# 2):

I checked the additional iteration test 8 to /100.000 to see if there is any real speed difference between the "maraca" and "fastIsHex" methods, and also normalized them to use the exact same precondition code:

Launch # 1

 Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms Method "fastIsHex" => #100000 iterations in 5313ms

Launch # 2

 Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms Method "fastIsHex" => #100000 iterations in 5334ms

those. the difference in speed between the two methods is at best minimal and probably due to a measurement error (since I run it on my workstation, and not specifically for a clean test environment).

Regex vs brute force for small strings

More articles: