Any kind of treatment. * How. {0,1024} in perl RE?

We allow some custom RE for email filtering. In the past, we encountered some performance issues with REs, which, for example, contained .* When comparing with arbitrarily large email messages. We found a simple solution for s/\*/{0,1024}/ on a user-supplied RE. However, this is not an ideal solution, as it will break with the following scheme:

 /[*]/ 

And instead of coming up with some kind of confusing recipe to account for all the possible mutations of the user input RE, I would just limit the perl interpretation of the * and + characters to have a maximum length of 1024 characters.

Is there any way to do this?

+6
source share
4 answers

Update

Added (?<!\\) before the quantifiers, because the escaped * + must not match. Replacing will still fail if there is \\* (match \ 0 or more times).

The improvement will be

 s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/ s/(?<!\\)\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/ 

See here at Regexr

This means matching [*+] , but only if the closure is ahead and not [ before that. And before square brackets there is no \ (part (?<!\\) ).

(?! ... ) is a negative look

(?<! ... ) - negative lookbehind

See perlretut for more details.

Update 2 includes possessive quantifiers

 s/(?<!(?<!\\)[\\+*?])\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/ # for + s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/ # for * 

See here at Regexr

It seems to work, but now it is getting really complicated!

+4
source

This does not answer your question, but you should be aware of other problems with user-provided regular expressions, see for example this summary in OWASP . Depending on your exact situation, it might be better to write or find a regular simple template matching library?

+5
source

Get the tree using Regexp :: Parser and change the regular expression the way you want, or provide the Regexp :: English GUI interface

+4
source

You mean, besides correcting the source?

  • You can split the input texts into shorter pieces and match only those. But then again, you will not fit the β€œline” break.
  • You can split the regexp, search for only 1st char, load the next 1024 characters of text, and then match all this regexp (obviously this doesn't work with regexp starting from.)
  • Find the first char for the regular expression. * + () \, find this, load 1024 characters before and after, and then match the entire regex to this line. (complex and trimmed for errors in a weird unexpected regular expression)
+1
source

Source: https://habr.com/ru/post/903826/


All Articles