Word boundary with words beginning or ending with special characters gives unexpected results.

Let's say I want to combine the presence of a phrase Sortes\index[persons]{Sortes}in a phrase test Sortes\index[persons]{Sortes} text.

Using python reI could do this:

>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

This works, but I want to avoid the search pattern in Sortesorder to give a positive result to the phrase test Sortes\index[persons]{Sortes} text.

>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>

Therefore, I use a template \b, for example:

search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)

Now I do not get a match.

If the search pattern does not contain characters []{}, it works. For instance:.

>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>

Also, if I delete the final one r'\b', it also works:

re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>

In addition, the documentation speaks of\b

, \b \w a\W ( ) \w / .

, \b (\W|$):

>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>

, ! ? ?

+4
2

, :

:

  • , .
  • , .
  • , , .

}\b } char (, _).

(\W|$), - .

, :

re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')

(?<!\w) lookbehind , char , (?!\w) lookahead , char .

, (, , , [^\W\d_] \w), , (?<!\S)/(?!\S) ).

+3

, , :

\b \w \w, , . '{Sortes}\b' - \w \w - '}', [a-zA-Z0-9_], \w.

0

Source: https://habr.com/ru/post/1681613/


All Articles