Find capture groups in user regex

I have a python application that should handle custom regular expressions. For performance reasons, I want to disable capture groups and backlinks.

My idea is to use another regular expression to make sure the user-defined regular expression does not contain any names or unnamed groups, such as:

def validate_user_regex(pattern): if re.match('[^\\\]\((?:\?P).*?[^\\\]\)', pattern) is not None: return False return True 

Although I think my idea might work for capturing groups, I'm not sure if this will prevent all kinds of backlinks. So are there any smarter ways to prevent capturing groups and backlinks in regular expressions?

+2
source share
2 answers

The regular expression language is not a regular language, so it cannot be reliably divided into significant parts using regular expressions (see OpenEx Open tags, with the exception of stand-alone XHTML tags for the same case for HTML).

Why not use your own Python parser for this?

 >>> r="whate(ever)(?:\\1)" >>> import sre_parse #the module used by `re' internally for regex parsing >>> sre_parse.parse(r) [('literal', 119), ('literal', 104), ('literal', 97), ('literal', 116), ('literal', 101), ('subpattern', (1, [('literal', 101), ('literal', 118), ('lit eral', 101), ('literal', 114)])), ('subpattern', (None, [('groupref', 1)]))] 

As you can see, this is a parse tree, and you are interested in subpattern nodes with non None in the first element and groupref .

+2
source

Your regular expression is very naive. (In fact, it was difficult for me to find a group design that matches your regular expression.)

To avoid false positives, for example, [(bar)] , it is necessary to analyze / compare the entire pattern from left to right. I came up with this regex:

 ^(?:\[(?:\\.|[^\]])*\]|\(\?:|[^(\\]|\\\D|\(\?[gmixsu])*$ 

regex101 demo.


Explanation:

 ^ # start of string anchor (?: # this group matches a single valid expression: # character classes: \[ # the opening [ (?: \\. # any escaped character | # or [^\]] # anything that not a closing ] )* # any number of times \] # the closing ] | # non-capturing groups: \(\?: # (?: literally | # normal characters (anything that not a backslash or ( [^(\\] | # meta sequences like \s \\\D | # inline modifiers like (?i) \(\?[gmixsu] )* # any number of valid expressions. $ # end of string anchor 

PS: This regular expression does not guarantee that the pattern is valid. (Compiling the template may still fail.)

0
source

Source: https://habr.com/ru/post/1493296/


All Articles