Can I improve the performance of this regular expression further

I am trying to find stream names from a stream dump file. Stream names are usually contained in double quotes on the first line of each stream dump. It might look so simple:

"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000] 

Or as much as possible:

 "[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000] 

The regular expression that I wrote is simple: "(.*)" . It captures everything inside double quotes as a group. However, this leads to a serious deviation, which requires many steps, as can be seen here . Verbally, we can explain this regex as “capture everything enclosed within double quotes as a group”

So, I came up with another regular expression that does the same thing: "([^\"])" . Verbally, we can describe this regular expression as “capture any number of characters without double quotes enclosed inside double quotes.” I did not find no quick regexp than this.It does not return anymore and therefore it requires minimal steps, as can be seen here .

I said it above to my colleague. He came up with another one: "(.*?)" . I did not understand how this works. It performs a significantly smaller rollback than the first, but slightly slower than the second, as can be seen here . but

  • I don’t understand why the rollback ends earlier.
  • I understand what ? - a quantifier that means once or not at all . However, I do not understand how once or not at all used here.
  • In fact, I cannot guess how we can describe this regular expression verbally.

My colleague tried to explain to me, but I still can not understand it completely. Can anyone explain?

+4
source share
1 answer

Brief Description and Solution

The regular expression "(.*)" Includes a lot of backtracking, because it finds the first " , and then captures the entire line and backtracking, looking for " which is closest to the end of the line. Since you have a substring with a quote closer to the beginning, there is more backtracking than with "(.*?)" , Because this lazy quantifier *? causes the regular expression engine to search for the closest " after the first one found " .

A solution with a negative character class "([^"]*)" is the best of 3, because it does not need to capture all, only all characters except " . However, to stop any digression and make the expression ultimately effective, you can use possessive quantifiers.

If you need to match strings like " + no quotes here + " , use

 "([^"]*+)" 

or even you don’t need to match the trailing quote in this situation:

 "([^"]*+) 

Watch the regex demo

In fact, I cannot guess how we can describe this regular expression verbally.

The last "([^"]*+) regular expression can be described as

  • " - find the first character " left of the line
  • ([^"]*+) - match and write to group 1 zero or more characters other than " as much as possible, and as soon as the engine finds a double quote, the match returns immediately, without backtracking.

Quantifiers

More on quantifiers from Rexegg.com :

A* Zero or more How, as much as possible (greedy), rejection of characters if the engine needs to be returned (docile)
A*? Zero or more As, as much as necessary to make the overall template fit (lazy)
A*+ Zero or more As, as much as possible (greedy), without giving up symbols if the engine tries to backtrack (possessive)

How do you see ? It is not a separate quantifier; it is part of another quantifier.

I advise you to learn more about why Lazy Quantifiers are expensive and that the Negated Class Solution really does manage its input line safely and quickly (where you just agree with a quote followed by no quotation marks and then the final quote).

The difference between .*? , .* and [^"]*+ quantifiers

  • The Greedy "(.*)" Solution works as follows: checks each character from left to right for a search " , and once found it captures the whole string to the end and checks each character if it is equal to " . Thus, in your input line it is accessed 160 times.

enter image description here

Since the next one " just around the corner, the number of reverse tracking steps is much less than with greedy matching.

enter image description here

  • An attractive quantification solution with a negative character class "([^"]*+)" works as follows: the engine finds the leftmost one " , and then captures all the characters that are not " until the first " . A negative character class [^"]*+ greedily matches zero or more characters that are not double quotes. Therefore, we guarantee that a star-point will never jump over the first one it encounters. " This is a more direct and efficient way of matching between some delimiters. Please note that in this solution we can fully trust * , which quantifies [^"] . Although it is greedy, there is no risk that [^"] will match too much because it is mutually exclusive with " . This is the principle of contrast from the regex style guide [see Source] .

Please note that the possessive quantifier does not allow the regular expression to reverse movement into a subexpression, after matching, the characters between " become one hard block that cannot be re-sorted" due to some "inconvenience" encountered by the regular expression engine, and it will not be able to shift any characters from and to this block of text.

For the current expression, this is not a big deal.

enter image description here

+7
source

Source: https://habr.com/ru/post/1260456/


All Articles