How to identify invalid regular expression characters in a long list of characters?

The goal is to port this Perl regular expression (from here ) to Python:

$norm_text =~ s/(\P{N})(\p{P})/$1 $2 /g;

First, I copied the array of characters \p{P}and \P{N}into a readable text file:

those.

import requests
from six import text_type

n_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Number.txt'
p_url = 'https://raw.githubusercontent.com/alvations/charguana/master/charguana/data/perluniprops/Punctuation.txt'

NUMS = text_type(requests.get(n_url).content.decode('utf8'))
PUNCTS = text_type(requests.get(p_url).content.decode('utf8'))

But when I tried to compile the regex:

re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

It throws this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ~-- at position 217 (line 1, column 218)

Looking back at the problem, it seems that dashes that are not escaped in character sets are the wrong range of Python regex characters. .

It appears that in the dash of characters:

>>> NUMS[215:352]
'~----------------------------------------------------------------------------------------------------------------------------------------'

Then I moved the hyphens to the beginning of the line, but there were poorer characters:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> re.compile(u'([{n}])'.format(n=NUMS2))

[output]:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 233, in compile
    return _compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 856, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, False)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 763, in _parse
    p = _parse_sub(source, state, sub_verbose)
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 415, in _parse_sub
    itemsappend(_parse(source, state, verbose))
  File "/Users/alvas/anaconda3/lib/python3.6/sre_parse.py", line 552, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range ¬-- at position 502 (line 1, column 503)

So, I moved more characters to the fore:

>>> NUMS2 = re.escape(NUMS[215:352]) + NUMS[:215] + NUMS[352:]
>>> NUMS3 = re.escape(NUMS2[500:504]) + NUMS2[:500] + NUMS2[504:]
>>> re.compile(u'([{n}])'.format(n=NUMS3))

, , " " .

" " ?

+4
1

, ^, -, ] \ .

NUMS = re.sub(r'[]^\\-]', r'\\\g<0>', NUMS)
PUNCTS = re.sub(r'[]^\\-]', r'\\\g<0>', PUNCTS)
rx = re.compile(u'([{n}])([{p}])'.format(n=NUMS, p=PUNCTS)

r'[]^\\-]' 1 char - ], ^, \ - - r'\\\g<0>' \ .

+3

Source: https://habr.com/ru/post/1683658/


All Articles