Regex and index do not match Unicode characters

One of the functions in the library that I am writing returns a string that is problematic when trying to find Unicode characters using the regex function or index . The line usually prints (using the Sublime text console to print in Unicode):

 <xml>V日한ế</xml> 

And I'm trying to match it like this: $string =~ m/V日한ế/ . I am using utf8 .

I apologize for not being able to reproduce the minimal hack example, because when I build the line myself and try to match it, everything works fine. I tried to use the hexdump function from this site, but it prints the same hexadecimal sequences for the Unicode characters in the string returned by the library, and the string I built ( $string2 = 'V日한ế' ): 56 e6 97 a5 ed 95 9c e1 ba bf . One of the libraries has the utf flag turned off, but the built-up one doesn't, but another test showed me that this is not a problem.

I have only one key to the source of the problem: output with use re 'debug'; . It gives the following message:

 Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"... 

It prints the character "日" in regular expression as %x{65e5} and the same character in the problem line as %x{e6}%x{97} . Other Unicode characters are likewise printed differently.

Can anyone who has experience debugging strings and encodings tell me why regex and index cannot find the unicode characters that I know are in my string, and how can I get these functions to find them?

+4
source share
1 answer

Let me make a reproducible test case:

  • create input file:

     $ perl -E'say "<xml>V\xe6\x97\xa5\xed\x95\x9c\xe1\xba\xbf</xml>"' >test.xml $ cat test.xml <xml>V日한ế</xml> 

    This writes some bytes to a file. Please note that my terminal emulator uses UTF-8.

  • Trying to naively match input:

     $ cat test.pl use strict; use warnings; use utf8; use autodie; use feature 'say'; open my $fh, "<", shift @ARGV; my $s = <$fh>; say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match"; say "string = ", map { sprintf "\\x{%x}", ord } split //, $s; $ perl test.pl test.xml <xml>V日한ế</xml> doesn't match string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{e6}\x{97}\x{a5}\x{ed}\x{95}\x{9c}\x{e1}\x{ba}\x{bf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a} 

    Oh, so the line from the file is considered as a line of bytes, and not correctly decoded code points. Who would have thought?

  • Add an extra layer :utf8 PerlIO:

     $ cat test-utf8.pl use strict; use warnings; use utf8; use autodie; use feature 'say'; open my $fh, "<:utf8", shift @ARGV; my $s = <$fh>; say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match"; say "string = ", map { sprintf "\\x{%x}", ord } split //, $s; $ perl test-utf8.pl test.xml Wide character in say at test-utf8.pl line 5, <$_[...]> line 1. <xml>V日한ế</xml> matches string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{65e5}\x{d55c}\x{1ebf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a} 

    Now it matches, because we read the correctly decoded code points from the file.

Do you get the same result? If you do not get a comparable result, what combination of perl / OS are you using (this is perl 5.18.1 on Ubuntu GNU / Linux).

Other problems with this code: There are several ways to present ế . Therefore, you should normalize the string in the regular expression and at your input:

 use Unicode::Normalize 'NFC'; my $regex_body = NFC "V日한ế"; my $s = NFC scalar <$fh>; ... m/\Q$regex_body/ ... 
+3
source

Source: https://habr.com/ru/post/1498118/


All Articles