Can I improve the performance of this regular expression further

Question

Can I improve the performance of this regular expression further

I am trying to find stream names from a stream dump file. Stream names are usually contained in double quotes on the first line of each stream dump. It might look so simple:

"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]

Or as much as possible:

 "[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]

The regular expression that I wrote is simple: "(.*)" . It captures everything inside double quotes as a group. However, this leads to a serious deviation, which requires many steps, as can be seen here . Verbally, we can explain this regex as “capture everything enclosed within double quotes as a group”

So, I came up with another regular expression that does the same thing: "([^\"])" . Verbally, we can describe this regular expression as “capture any number of characters without double quotes enclosed inside double quotes.” I did not find no quick regexp than this.It does not return anymore and therefore it requires minimal steps, as can be seen here .

I said it above to my colleague. He came up with another one: "(.*?)" . I did not understand how this works. It performs a significantly smaller rollback than the first, but slightly slower than the second, as can be seen here . but

I don’t understand why the rollback ends earlier.
I understand what ? - a quantifier that means once or not at all . However, I do not understand how once or not at all used here.
In fact, I cannot guess how we can describe this regular expression verbally.

My colleague tried to explain to me, but I still can not understand it completely. Can anyone explain?

+4

java performance regex

Mahesha999 Nov 23 '15 at 11:07

source share

1 answer

Wiktor stribiżew · Answer 1 · 2015-11-23T11:19:14+0000

Brief Description and Solution

The regular expression "(.*)" Includes a lot of backtracking, because it finds the first " , and then captures the entire line and backtracking, looking for " which is closest to the end of the line. Since you have a substring with a quote closer to the beginning, there is more backtracking than with "(.*?)" , Because this lazy quantifier *? causes the regular expression engine to search for the closest " after the first one found " .

A solution with a negative character class "([^"]*)" is the best of 3, because it does not need to capture all, only all characters except " . However, to stop any digression and make the expression ultimately effective, you can use possessive quantifiers.

If you need to match strings like " + no quotes here + " , use

 "([^"]*+)"

or even you don’t need to match the trailing quote in this situation:

 "([^"]*+)

Watch the regex demo

In fact, I cannot guess how we can describe this regular expression verbally.

The last "([^"]*+) regular expression can be described as

" - find the first character " left of the line
([^"]*+) - match and write to group 1 zero or more characters other than " as much as possible, and as soon as the engine finds a double quote, the match returns immediately, without backtracking.

Quantifiers

The difference between `.?` , `.` and `[^"]*+` quantifiers

The Greedy "(.*)" Solution works as follows: checks each character from left to right for a search " , and once found it captures the whole string to the end and checks each character if it is equal to " . Thus, in your input line it is accessed 160 times.

The Lazy "(.*?)" Solution works as follows: the engine finds the first " , and then moves along the template and tries the next token (which is " ) against T in THREAD1 . This will fail, so the engine backs off and allows .*? expand your correspondence to one element so that it matches T And again the engine advances in a pattern. Now he is trying to use " for H in THREAD1 . This fails, so the engine backs off and allows .*? expand and correspond to H Then the process repeats - the engine advances, fails, turns back, allows the lazy .*? expand its correspondence by one an element advances, fails, and so on, For each character matched with .*? engine must back off. From a computational point of view, this process of matching one element, advancing, rejecting, returning, expanding is "expensive."

Since the next one " just around the corner, the number of reverse tracking steps is much less than with greedy matching.

An attractive quantification solution with a negative character class "([^"]*+)" works as follows: the engine finds the leftmost one " , and then captures all the characters that are not " until the first " . A negative character class [^"]*+ greedily matches zero or more characters that are not double quotes. Therefore, we guarantee that a star-point will never jump over the first one it encounters. " This is a more direct and efficient way of matching between some delimiters. Please note that in this solution we can fully trust * , which quantifies [^"] . Although it is greedy, there is no risk that [^"] will match too much because it is mutually exclusive with " . This is the principle of contrast from the regex style guide [see Source] .

Please note that the possessive quantifier does not allow the regular expression to reverse movement into a subexpression, after matching, the characters between " become one hard block that cannot be re-sorted" due to some "inconvenience" encountered by the regular expression engine, and it will not be able to shift any characters from and to this block of text.

For the current expression, this is not a big deal.

Can I improve the performance of this regular expression further

Brief Description and Solution

Quantifiers

The difference between .*? , .* and [^"]*+ quantifiers

More articles:

The difference between `.?` , `.` and `[^"]*+` quantifiers