Syntax is a way to specify a character by value:
\xAB indicates a code point in the range 0-FF.\x{ABCD} indicates a code point in the range 0-FFFF.
This wording from the manual is a bit confusing, perhaps in an attempt to be precise. Character values ββ128-255 (and some) are encoded as 2 bytes in UTF-8. Thus, the unicode regular expression will match 7-bit pure ASCII, but will not match other encodings / code pages (i.e. CP437 ) that use values ββin the specified range. The manual says in a roundabout way that the regular expression unicode is only suitable for use with correctly encoded input. However,
This does not mean that \xABCD parsed as \x{ABCD} (one character). It is parsed as \xAB (one character) and then CD (two characters) 1 . These braces address this parsing ambiguity problem:
After \ x, up to two hexadecimal digits are read. In UTF-8 mode, \ x {...} is allowed.
Some other languages ββuse \u instead of \x for a longer form.
1 Note that this corresponds to:
preg_match ('/ \ xC3A4 / u', "\ xC3". "A4");
source share