Erlang regex for Chinese characters

Question

Erlang regex for Chinese characters

TL DR:

25> re:run(".asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]). ** exception error: bad argument in function re:run/3 called as re:run([1081,1094,1091,46,97,115,100], "^(.*\\..*)$", [{capture,none}])

How to do it? "ytsu" are characters that do not belong to the Latin encoding; is there any way to tell the re module or the whole system to work with a different encoding for the "strings"?

ORIGINAL QUESTION (for recording):

Another question is "Erlang Programming")

Chapter 16 gives an example of reading tags from mp3 files. It works, great. But there seems to be some error in the provided module, lib_find , which has a function to find the path for matching files. This is a call that works:

 61> lib_find:files("../..", "*.mp3", true). ["../../early/files/Veronique.mp3"]

and this call fails:

 62> lib_find:files("../../..", "*.mp3", true). ** exception error: bad argument in function re:run/3 called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117, 105,108,100,115,47,50,48,46,49,47,111|...], "^(.*\\.mp3)$", [{capture,none}]) in call from lib_find:find_files/6 (lib_find.erl, line 29) in call from lib_find:find_files/6 (lib_find.erl, line 39) in call from lib_find:files/3 (lib_find.erl, line 17)

Ironically, the investigation led to the discovery of the culprit in Erlang’s own setup:

.kerl / builds / 20.1 / otp_src_20.1 / Library / SSH / test / ssh_sftp_SUITE_data / sftp_tar_test_data_ 高兴

OK, this seems to mean that Erlang uses a more strict default character set, which does not include hànzì . What are the options? Obviously, I can simply ignore this and continue my research, but I feel that I can learn more from this. =) For example: where / how can I fix the default encoding? I surprised him a little with something other than UTF8, by default - maybe I'm wrong?

Thanks!

+5

erlang character-encoding

alexakarpov Oct 13 '17 at 17:55

source share

1 answer

zxq9 · Accepted Answer · 2017-10-14T04:20:20+0000

TL DR:

UTF-8 modes are available by placing the regular expression pattern in Unicode mode with the unicode option. (Note that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1 )

 1> re:run("なにこれ.txt", "^(.*\\..*)$"). ** exception error: bad argument in function re:run/2 called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$") 2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]). {match,[{0,16},{0,16}]}

And from your exact example:

 11> re:run(".asd", "^(.*\\..*)$", [unicode, {capture, none}]). match

or

 12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]). {ok,{re_pattern,1,1,0, <<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255, 255,255,255,...>>}} 13> re:run(".asd", Pattern, [{capture, none}]). match

The docs for re quite long and extensive, but that is because regular expressions are a complex subject. You can find options for compiled regular expressions in documents for re:compile/2 and startup options in documents for re:run/3 .

Discussion

Erlang decided that the lines, although still a list of code pages, are UTF-8 everywhere . Since I work in Japan and do this all the time, it was a great relief for me because I can stop using about half of the conversion libraries that I needed in the past (yay!), But it made things a bit more complicated for users of the string module because Now many operations are performed under slightly different assumptions (the string is still considered “flat”, even if it is a deep list of grapheme clusters, if these clusters exist at the first level of the list).

Unfortunately, coding is not a simple matter, and UTF-8 is nothing but simple, as soon as you get out of the most common representations - so much of this happens. I can tell you with confidence that working with UTF-8 data in binary, string, deep and io_data() forms, whether it is file names, file data, network data or user input from WX or web forms, works as soon as you read unicode, regular expression and string documents.

But this, of course, is a lot of new things to meet. In 99% of cases, everything will work as expected if you decode all incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1 and specify binary strings as utf8 binary types everywhere:

 3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>. <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>> 4> UnicodeString = unicode:characters_to_list(UnicodeBin). [12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377, 12290] 5> io:format("~ts~n", [UnicodeString]).この文書はUTF-8です。 ok 6> re:run(UnicodeString, "UTF-8", [unicode]). {match,[{15,5}]} 7> re:run(UnicodeBin, "UTF-8", [unicode]). {match,[{15,5}]} 8> unicode:characters_to_binary(UnicodeString). <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>> 9> unicode:characters_to_binary(UnicodeBin). <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>

Erlang regex for Chinese characters

More articles: