Perl Regular Expression Matching on Unicode Large Code Points

I am trying to replace various characters with either a single quote or a double quote.

Here is my test file:

# Replace all with double quotes " fullwidth " left " right „ low " normal # Replace all with single quotes ' normal ' left ' right ‚ low ‛ reverse ` backtick 

I'm trying to do it ...

 perl -Mutf8 -pi -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/ug" test.txt perl -Mutf8 -pi -e 's/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/ug' text.txt 

But only the backtick symbol is replaced properly. I think this is due to the fact that the other code points are too large, but I can not find the documentation on this.

Here I have one-liner that resets Unicode code points to make sure they match my regular expression.

 $ awk -F\ '{print $1}' test.txt | \ perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord)." ".$_."\n"}' U+FF02 " U+201C " U+201D " U+201E „ U+0022 " U+0027 ' U+2018 ' U+2019 ' U+201A ‚ U+201B ‛ U+0060 ` 

Why doesn't my regex match?

+4
source share
2 answers

This does not match because you forgot -CSAD in your Perl call and did not set $PERL_UNICODE in your environment. You just said -Mutf8 to indicate that your source code is in this encoding. This does not affect I / O.

You need:

 $ perl -CSAD -pi.orig -e "s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/'/g" test.txt 

I mention this thing in this answer a couple of times.

+16
source

With use utf8; you told Perl that your source code is UTF-8. This is useless (albeit harmless) as you have limited your ASCII source code.

With /u you told Perl to use the Unicode definitions \s , \d , \w . This is useless (albeit harmless) since you are not using any of these templates.

You have not decrypted your input, so your inputs consist solely of bytes, so most of the characters in your class (for example, \x{2018} ) cannot match anything. You need to decode your input (and, of course, encode your output). Using -CSD will most likely do this.

 perl -CSD -i -pe' s/[\x{2018}\x{201A}\x{201B}\x{FF07}\x{2019}\x{60}]/\x27/g; s/[\x{FF02}\x{201C}\x{201D}\x{201E}]/"/g; ' text.txt 
+6
source

Source: https://habr.com/ru/post/1437288/


All Articles