Overheads / performance improvement using regular expressions

Question

Overheads / performance improvement using regular expressions

If I need to check if, for example, the word A or the word B exists in the text (String), is there a difference in performance:

if(text.contains(wordA) || text.contains(wordB))

to use a regular expression that looks for a line?
Does it depend on the format of the regular expressions?
Or is it just a matter of taste?

UPDATE:
If text.contains(wordA) - false , then it will be appreciated text.contains(wordB) .
This means that contains will be called twice.

I thought that, in the performance of a regular expression could be better than calling contains twice.

+4

java performance : string regex full-text-search

Cratylus Jan 20 '12 at 7:26

source share

5 answers

The code that you clearly express in your intent is more readable than regular expression, and also probably faster.

In any case, a very low probability that this part of your code will cause serious performance problems. Therefore, I would not worry about performance here, but about readability and maintainability.

+4

Jb nizet Jan 20 '12 at 7:35

source share

While regexp performance is lower, it has more expressive power and often is more important. For instance.

  "performance".contains("form") // is true

this may not be the wheat that you intend to "word". Instead, you can have a template

  "\\bform\\b"

This will only match the full word in the line, which may be at the beginning or at the end.

+4

Peter Lawrey Jan 20 '12 at 7:45

source share

Yes, that’s the difference. It contains various manipulations with arrays to find words, the regular expression uses different logic, so it will be different, the performance will even change depending on how you use the regular expression.

Will it be significant? it is hard to say. But the best thing you should understand:

First, enter your code and do not worry about the results of the poll until you run into problems, after profiling clearly indicates that this test is a problem.

I would have just used the contains method. But this opinion is without actual testing.

+3

Peter Jan 20 '12 at 7:37

source share

In my opinion this is a matter of taste. Avoid doing premature optimization, see. Practical rules for premature optimization .

As a general rule, if you are looking for ~~word~~ substring rather than templates, then do not use regular expressions.
There will be a slight performance difference for such a simple regular expression for text search, so if you only perform this search once in a while, this is not a performance issue. If you do it for several thousand or more times in a loop, then do a test, if you have a performance problem

+2

stema Jan 20 '12 at 7:39

source share

Joey · Accepted Answer · 2012-01-20T07:36:52+0000

In this trivial example, you should not see most of the difference in performance, but the regular expression is involved purely from the algorithms

 wordA|wordB

it really will be faster, because it just skips one pass through the string and uses a state machine to match one of the two substrings. However, this is first compensated by the construction of a finite state machine, which in this case should be fairly linear along the length of the regular expression. You can compile the regular expression first so that it costs only once while the compiled object lives.

Thus, essentially the cost is reduced to:

linear search on a line twice (2 • line length)
or a linear search string once and building a DFA (string length + length of the regular expression)

if your text is very large, and the substring is very small, then it may be worth.

However, you are most likely optimizing the wrong place. Use the profiler to find actual bottlenecks in the code and optimize them; never worry about such trivial “optimizations” if you cannot prove to them that they affect.

Finally, we must consider the following: a regular expression, you can be sure that you actually meet the words (or things like words), not according to what may be the real reason for the consideration of a regular expression instead of contains .

Overheads / performance improvement using regular expressions

More articles: