Regex \ R not working inside character class

In PHP, the \R escape character, which must match any sequence of newlines, does not work inside the character class.

I recently found out about this special character in another answer here on stackoverflow, and to be honest, I couldn't find much online to document its existence - it is not mentioned anywhere on php.net except in the user's comment.

Question (s):

  • Why doesn't \R work inside a character class?
  • Where is it documented?

EXAMPLE 1: ( https://regex101.com/r/vA8xV3/3 )

 $a = "line1 line2" echo preg_replace('/\R/',' ',$a); 

Returns (finds a match, replaces with a single space):

 line1 line2 

EXAMPLE 2: ( https://regex101.com/r/vA8xV3/2 )

 $a = "line1 line2" echo preg_replace('/[\R]/',' ',$a); 

Returns (no match):

 line1 line2 
+6
source share
4 answers

From the PCRE manual:

Escaping sequences in character classes

You can use all sequences that define a single signed value both inside and outside character classes. Also, inside characterclass, \b interpreted as the backspace character (hex 08).

\N not allowed in a character class. \b , \R and \X are not special inside the character class. Like other unrecognized shoots of sequences, they are treated as the literal characters "B", "R", and "X" by default , but they cause an error if the PCRE_EXTRA option is PCRE_EXTRA . Outside of the characteristic class, these sequences have different meanings.

(emphasis on the corresponding bit added by me)

+5
source

This is the correct behavior. \ R only works outside the character class. (At least this is true in grep and many others)

For grep:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html

PHP uses perl-like expressions, so see peardoc:

http://perldoc.perl.org/perlrebackslash.html#Misc

Since \ R can match a sequence of more than one character, it cannot be inserted into a character class in square brackets; / [\ R] / - error; use \ v instead

+3
source

For the reason why \R not allowed inside the character class, and \d , \s , \w , ... are allowed inside the character class, this is because \R can match CR LF ( \r\n ), which consists of 2 code points. For the same reason that \X not allowed inside a character class, since it corresponds to a grapheme cluster that can contain multiple code points.

It is assumed that the character class must correspond to only one code / code, which makes it a deterministic construct in the sense that it does not require backtracking. Allowing the sequence of code points / code block to correspond to the character class leads to the fact that the character class has a variable length, complicates the analysis of the minimum length / maximum length, which is used in several optimization options. It also requires a change in the semantics of correspondence. For example, given [\r\n\R] , should it match \r\n in the string "\r\n" , or should it follow the declared order and match only \R ? In case of failure, will we be allowed to retreat?

I am not sure about the implementation of PCRE. However, in Java, length analysis is used to optimize the repetition construct (for example, with a repetition of a fixed-length construct, you do not need to store the number of characters matched in each repetition for backtracking), optimize the case when the input string does not satisfy the minimum length requirements and determine whether the expression is allowed in look-behind or not.

+1
source

I think I understand your question, In fact, the character class matches the expression [] explicitly, so in your case [\R] will match \ and a R For example, in the line balhblahRajndsf\ you must match \ and R It makes sense?

http://www.zytrax.com/tech/web/regex.htm

See brackets, ranges and deviations in the link above

0
source

Source: https://habr.com/ru/post/986748/


All Articles