What does \ x mean in PHP PCRE?

From the manual :

After \x , up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, \x{...} allowed, where the content of curly brackets is a string of hexadecimal digits. this is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh , matches the UTF-8 double-byte character if the value is greater than 127.

And what does it mean?

The code point "Γ€" is E4, and the UTF-8 representation is C3A4, but the neter of these matches is:

 $t = 'Γ€'; // same as "\xC3\xA4"; preg_match('/\\xC3A4/u', $t); // doesn't match preg_match('/\\x00E4/u', $t); // doesn't match 

With curly braces, it matches when I specify the code:

 preg_match('/\\x{00E4}/u', $t); // matches 
+4
source share
1 answer

Syntax is a way to specify a character by value:

  • \xAB indicates a code point in the range 0-FF.
  • \x{ABCD} indicates a code point in the range 0-FFFF.

This wording from the manual is a bit confusing, perhaps in an attempt to be precise. Character values ​​128-255 (and some) are encoded as 2 bytes in UTF-8. Thus, the unicode regular expression will match 7-bit pure ASCII, but will not match other encodings / code pages (i.e. CP437 ) that use values ​​in the specified range. The manual says in a roundabout way that the regular expression unicode is only suitable for use with correctly encoded input. However,

This does not mean that \xABCD parsed as \x{ABCD} (one character). It is parsed as \xAB (one character) and then CD (two characters) 1 . These braces address this parsing ambiguity problem:

After \ x, up to two hexadecimal digits are read. In UTF-8 mode, \ x {...} is allowed.

Some other languages ​​use \u instead of \x for a longer form.


1 Note that this corresponds to:

preg_match ('/ \ xC3A4 / u', "\ xC3". "A4");

+4
source

Source: https://habr.com/ru/post/1499717/


All Articles