Java \ cx regular expression (escape characters)

Javadoc for java.util.regex.Patternsays \cxrepresents the control character corresponding to x. So I thought I would Pattern.compile()reject \c, followed by any character other than [@-_], but it is not!

As @tchrist commented on one of the answers to What is a regex for control characters? , the range is not checked at all. I tested a couple of characters from the higher blocks, as well as the astral planes, it looks like it just flips the 7th LSB of the code point value.

So is it a Javadoc error or an implementation error, or am I not understanding something? Is \cxJava syntax developed or supported by other regular expression engines, especially Perl? How is it processed there?

+4
source share
1 answer

All versions of Perl behave the same for the following screens:

  • If followed \cby an ASCII capital letter or one of @[\]^_?,

    chr(ord($char) ^ 0x40)

    This provides full coverage of all ASCII control characters ( 0x00.. 0x1F, 0x7F).

    \c@ === \x00
    \cA === \x01
    ...
    \cZ === \x1A
    \c[ === \x1B
    \c\ === \x1C   # Sometimes \c\\ is needed.
    \c] === \x1D
    \c^ === \x1E
    \c_ === \x1F
    \c? === \x7F
    
  • When \cfollowed by a lowercase ASCII,

    chr(ord($char) ^ 0x60)

    This makes the escape register insensitive.

    \ca === \cA === \x01
    ...
    \cz === \cZ === \x1A
    

, Perl 5.20.

  • & GE; 5,20,

    • \c , ASCII !"#$%&'()*+,-./:;<=>{|}~,

      chr(ord($char) ^ 0x40), (is more clearly written simply as).

    • \c ASCII (0x00.. 0x1F, 0x7F) ASCII (& ge; 0x80),

      Character following "\c" must be printable ASCII.

  • < 5,20,

    • \c , ASCII-, !"#$%&'()*+,-./:;<=>{|}~ ASCII (0x00.. 0x1F, 0x7F),

      chr(ord($char) ^ 0x40)

    • \c & ge; 0x100,

      (chr(ord(substr(encode_utf8($char, 0, 1)) ^ 0x40) . encode_utf8($char, 1)).

    • \c 0x80.. 0xFF,

      chr(ord($char) ^ 0x40), , & ge; 0x100.

+5

Source: https://habr.com/ru/post/1627334/


All Articles