Empty line instead of unauthorized group error

Question

Empty line instead of unauthorized group error

I have this piece of code:

for n in (range(1,10)): new = re.sub(r'(regex(group)regex)?regex', r'something'+str(n)+r'\1', old, count=1)

It generates a unique group error. But if it is second to none, I want to add an empty string there instead of throwing an error. How could I achieve this?

Note. My complete code is much more complicated than this example. But if you find a better solution on how to sort out matches and add a number inside, you can share. My full code is:

 for n in (range(1,(text.count('soutez')+1))): text = re.sub(r'(?i)(\s*\{{2}infobox medaile reprezentant(ka)?\s*\|\s*([^\}]*)\s*\}{2}\s*)?\{{2}infobox medaile soutez\s*\|\s*([^\}]*)\s*\}{2}\s*', r"\n | reprezentace"+str(n)+r" = \3\n | soutez"+str(n)+r" = \4\n | medaile"+str(n)+r" = \n", text, count=1)

+5

python regex python-2.x

aleskva Feb 19 '16 at 10:31

source share

3 answers

I looked at that again.
Please note that unfortunately you have to deal with NULL,
but here are the rules you must follow.

In the matches below, all do not match successfully.
You must do this to find out the rules.

It is not as easy as you think. Look at the results. There is no obvious unshakable formwise way to say whether you get NULL or EMPTY.

However, looking closer at it, the rules come out and are pretty simple.
These rules must be followed if you care about NULL.

There are only two rules:

Rule No. 1 - Any GROUP command that cannot be reached will result in NULL

  (?<Alt_1> # (1 start) (?<a> a )? # (2) (?<b> b? ) # (3) )? # (1 end) | (?<Alt_2> # (4 start) (?<c> c? ) # (5) (?<d> d? ) # (6) ) # (4 end)

  ** Grp 0 - ( pos 0 , len 0 ) EMPTY ** Grp 1 [Alt_1] - ( pos 0 , len 0 ) EMPTY ** Grp 2 [a] - NULL ** Grp 3 [b] - ( pos 0 , len 0 ) EMPTY ** Grp 4 [Alt_2] - NULL ** Grp 5 [c] - NULL

Rule No. 2 - Any code command that cannot be matched to INSIDE will result in NULL

  (?<A_1> # (1 start) (?<a1> a? ) # (2) )? # (1 end) (?<A_2> # (3 start) (?<a2> a )? # (4) )? # (3 end) (?<A_3> # (5 start) (?<a3> a ) # (6) )? # (5 end) (?<A_4> # (7 start) (?<a4> a )? # (8) ) # (7 end)

 ** Grp 0 - ( pos 0 , len 0 ) EMPTY ** Grp 1 [A_1] - ( pos 0 , len 0 ) EMPTY ** Grp 2 [a1] - ( pos 0 , len 0 ) EMPTY ** Grp 3 [A_2] - ( pos 0 , len 0 ) EMPTY ** Grp 4 [a2] - NULL ** Grp 5 [A_3] - NULL ** Grp 6 [a3] - NULL ** Grp 7 [A_4] - ( pos 0 , len 0 ) EMPTY ** Grp 8 [a4] - NULL

0

sln Feb 19 '16 at 23:39

source share

To simplify:

Problem

You get the error "sre_constants.error: unmatched group" from the Python 2.7 regular expression.
You have any regular expression pattern with optional groups (with or without nested expressions) and is trying to use these groups in the substitution replacement argument ( re.sub(pattern, *repl*, string) or compiled.sub(*repl*, string) )

Decision:

For match.group(1) results, return match.group(1) instead of \1 (or 2, 3, etc.). It; no or not. The result of the group can be returned using the function or lambda.

Example

You use a common regex to erase C-style comments . Its design uses optional group 1 to convey pseudo-comments that should not be deleted (if they exist).

 pattern = r'//.*|/\*[\s\S]*?\*/|("(\\.|[^"])*")' regex = re.compile(pattern)

Using \1 fails with the error: "sre_constants.error: unmatched group":

 return regex.sub(r'\1', string)

Using .group(1) succeeds:

 return regex.sub(lambda m: m.group(1), string)

For those unfamiliar with lambda, this solution is equivalent to:

 def optgroup(match): return match.group(1) return regex.sub(optgroup, string)

See the accepted answer for a great discussion on why \1 fails due to error 1519638. Although the accepted answer is authoritative, it has two drawbacks: 1) the example from the original question is so confusing that it makes the example difficult to read the solution and 2) it offers to return a group or an empty string - this is not required, you can simply call .group() in each match.

0

Jeremydouglass Jan 15 '18 at 0:58

source share

Wiktor stribiżew · Accepted Answer · 2016-02-19T22:41:17+0000

Root cause

Prior to Python 3.5, re.sub to failed capture groups in Python re.sub were not populated with an empty string. The following is a description of Bug 1519638 on bugs.python.org . Thus, when using the backlink to a group that did not participate in the match, an error occurred.

There are two ways to fix this problem.

Solution 1: Add empty alternatives to add optional groups

You can replace all optional capture groups (those constructs like (\d+)? ), With mandatory ones, with an empty alternative (i.e. (\d+|) ).

Here is an example of a failure :

 import re old = 'regexregex' new = re.sub(r'regex(group)?regex', r'something\1something', old) print(new)

Replacing a single line with

 new = re.sub(r'regex(group|)regex', r'something\1something', old)

He works.

Solution 2: Use a lambda expression in the replacement and check if the group is not `None`

This approach is necessary if you have additional groups within another optional group.

You can use lambda in the replacement part to check if the group is initialized, and not None , with lambda m: m.group(n) or '' Use this solution in your case , because you have two backlinks - # 3 and # 4 - in the replacement pattern, but several matches (see Match 1 and 3) do not have an initialized capture group 3. This is because the entire first part is (\s*\{{2}funcA(ka|)\s*\|\s*([^}]*)\s*\}{2}\s*|) - does not participate in the match, and the internal group Capture 3 (ie ([^}]*) ) simply does not fill up even after adding an empty alternative.

 re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', r"\n | funcA"+str(n)+r" = \3\n | funcB"+str(n)+r" = \4\n | string"+str(n)+r" = \n", text, count=1)

should be rewritten with

 re.sub(r'(?i)(\s*{{funcA(ka|)\s*\|\s*([^}]*)\s*}}\s*|){{funcB\s*\|\s*([^}]*)\s*}}\s*', lambda m: r"\n | funcA"+str(n)+r" = " + (m.group(3) or '') + "\n | funcB" + str(n) + r" = " + (m.group(4) or '') + "\n | string" + str(n) + r" = \n", text, count=1)

See the IDEONE demo

 import re text = r''' {{funcB|param1}} *some string* {{funcA|param2}} {{funcB|param3}} *some string2* {{funcB|param4}} *some string3* {{funcAka|param5}} {{funcB|param6}} *some string4* ''' for n in (range(1,(text.count('funcB')+1))): text = re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', lambda m: r"\n | funcA"+str(n)+r" = "+(m.group(3) or '')+"\n | funcB"+str(n)+r" = "+(m.group(4) or '')+"\n | string"+str(n)+r" = \n", text, count=1) assert text == r''' | funcA1 = | funcB1 = param1 | string1 = *some string* | funcA2 = param2 | funcB2 = param3 | string2 = *some string2* | funcA3 = | funcB3 = param4 | string3 = *some string3* | funcA4 = param5 | funcB4 = param6 | string4 = *some string4* ''' print 'ok'

Empty line instead of unauthorized group error

Root cause

Solution 1: Add empty alternatives to add optional groups

Solution 2: Use a lambda expression in the replacement and check if the group is not None

Problem

Decision:

Example

More articles:

Solution 2: Use a lambda expression in the replacement and check if the group is not `None`