One of the functions in the library that I am writing returns a string that is problematic when trying to find Unicode characters using the regex function or index . The line usually prints (using the Sublime text console to print in Unicode):
<xml>V日한ế</xml>
And I'm trying to match it like this: $string =~ m/V日한ế/ . I am using utf8 .
I apologize for not being able to reproduce the minimal hack example, because when I build the line myself and try to match it, everything works fine. I tried to use the hexdump function from this site, but it prints the same hexadecimal sequences for the Unicode characters in the string returned by the library, and the string I built ( $string2 = 'V日한ế' ): 56 e6 97 a5 ed 95 9c e1 ba bf . One of the libraries has the utf flag turned off, but the built-up one doesn't, but another test showed me that this is not a problem.
I have only one key to the source of the problem: output with use re 'debug'; . It gives the following message:
Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"...
It prints the character "日" in regular expression as %x{65e5} and the same character in the problem line as %x{e6}%x{97} . Other Unicode characters are likewise printed differently.
Can anyone who has experience debugging strings and encodings tell me why regex and index cannot find the unicode characters that I know are in my string, and how can I get these functions to find them?
source share