Hyphen in a detailed regular expression comment causes an error

Question

Hyphen in a detailed regular expression comment causes an error

What is wrong with the following code - I pointed it to a hyphen in the comment, but why should this cause an error?

import re valid = re.compile(r'''[^ \uFFFE\uFFFF # non-characters ]''', re.VERBOSE) Traceback (most recent call last): File "valid.py", line 5, in <module> ]''', re.VERBOSE) File "/usr/local/lib/python3.3/re.py", line 214, in compile return _compile(pattern, flags) File "/usr/local/lib/python3.3/re.py", line 281, in _compile p = sre_compile.compile(pattern, flags) File "/usr/local/lib/python3.3/sre_compile.py", line 494, in compile p = sre_parse.parse(p, flags) File "/usr/local/lib/python3.3/sre_parse.py", line 748, in parse p = _parse_sub(source, pattern, 0) File "/usr/local/lib/python3.3/sre_parse.py", line 360, in _parse_sub itemsappend(_parse(source, state)) File "/usr/local/lib/python3.3/sre_parse.py", line 506, in _parse raise error("bad character range") sre_constants.error: bad character range

The following segment without a hyphen does not contain errors:

 import re valid = re.compile(r'''[^ \uFFFE\uFFFF # non characters !! no errors ]''', re.VERBOSE)

Edit:

Adding @nhahtdh to the answer, string concatenation seems to be another reasonable way to comment on verbose-style character classes:

 valid = re.compile( r'[^' r'\u0000-\u0008' # C0 block first segment r'\u000Bu\u000C' # allow TAB U+0009, LF U+000A, and CR U+000D r'\u000E-\u001F' # rest of C0 r'\u007F' # disallow DEL U+007F r'\u0080-\u009F' # All C1 block r']' # don't forget this! r''' | [0-9] # normal verbose style | [az] # another term +++ ''', re.VERBOSE)

+4

python regex

Basel shishani Sep 17 '13 at 4:37

source share

2 answers

Comments are not always good at regular expressions, and it looks like your regular expression engine parses a hyphen as part of a regular expression. You cannot rely on comments that are not parsed here. This is a good thing to learn before you implement this code.

-one

fixedgod Sep 17 '13 at 4:48

source share

nhahtdh · Accepted Answer · 2013-09-17T04:42:50+0000

According to the documentation (my emphasis):

re.X
re.VERBOSE
This flag allows you to write regular expressions that look better. Spaces inside the pattern are ignored, unless in the character class or is preceded by an unrelated backslash, and when the line contains "#" in neither the character class nor in the previous unexcited backslash, all characters from the leftmost ones are "#" in end of line are ignored.

Basically, you cannot comment inside a character class, and a space inside a character class is considered significant.

Since # is inside a character class, it does not work as a comment, and everything inside the character class is parsed as part of the character class without exception (even a new character in a string is parsed as part of the character class), an error is caused due to an invalid character range nc .

A valid way to write an expression would be:

 valid = re.compile(r'[^\uFFFE\uFFFF] # non-characters', re.VERBOSE)

Here is one suggestion on how to comment when you want to explain a long class of characters:

 r''' # LOTS is for foo # _ is a special fiz # OF-LITERAL is for bar [^LOTS_OF-LITERAL] '''

Hyphen in a detailed regular expression comment causes an error

More articles: