TL DR:
UTF-8 modes are available by placing the regular expression pattern in Unicode mode with the unicode option. (Note that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1 )
1> re:run("なにこれ.txt", "^(.*\\..*)$"). ** exception error: bad argument in function re:run/2 called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$") 2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]). {match,[{0,16},{0,16}]}
And from your exact example:
11> re:run(".asd", "^(.*\\..*)$", [unicode, {capture, none}]). match
or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]). {ok,{re_pattern,1,1,0, <<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255, 255,255,255,...>>}} 13> re:run(".asd", Pattern, [{capture, none}]). match
The docs for re quite long and extensive, but that is because regular expressions are a complex subject. You can find options for compiled regular expressions in documents for re:compile/2 and startup options in documents for re:run/3 .
Discussion
Erlang decided that the lines, although still a list of code pages, are UTF-8 everywhere . Since I work in Japan and do this all the time, it was a great relief for me because I can stop using about half of the conversion libraries that I needed in the past (yay!), But it made things a bit more complicated for users of the string module because Now many operations are performed under slightly different assumptions (the string is still considered “flat”, even if it is a deep list of grapheme clusters, if these clusters exist at the first level of the list).
Unfortunately, coding is not a simple matter, and UTF-8 is nothing but simple, as soon as you get out of the most common representations - so much of this happens. I can tell you with confidence that working with UTF-8 data in binary, string, deep and io_data() forms, whether it is file names, file data, network data or user input from WX or web forms, works as soon as you read unicode, regular expression and string documents.
But this, of course, is a lot of new things to meet. In 99% of cases, everything will work as expected if you decode all incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1 and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>. <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>> 4> UnicodeString = unicode:characters_to_list(UnicodeBin). [12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377, 12290] 5> io:format("~ts~n", [UnicodeString]).この文書はUTF-8です。 ok 6> re:run(UnicodeString, "UTF-8", [unicode]). {match,[{15,5}]} 7> re:run(UnicodeBin, "UTF-8", [unicode]). {match,[{15,5}]} 8> unicode:characters_to_binary(UnicodeString). <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>> 9> unicode:characters_to_binary(UnicodeBin). <<227,129,147,227,129,174,230,150,135,230,155,184,227,129, 175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>