Mathematica regular expressions for Unicode strings

It was an exciting debugging experience. Can you tell the difference between two lines?

StringReplace["–", RegularExpression@ "[\\s\\S]" -> "abc"] StringReplace["-", RegularExpression@ "[\\s\\S]" -> "abc"] 

They are very different things when you evaluate them. It turns out because the line replaced in the first line consists of unicode en dash, unlike the plain old ascii line in the second line.

In the case of a Unicode string, the regular expression does not match. I meant the regular expression "[\ s \ S]" means "match any character (including newline)", but Mathematica seems to treat it as "matching any ascii character".

How can I fix the regex so that the first line above evaluates the same as the second? Also, is there an asciify filter that I can apply to strings first?

PS: Mathematica's documentation says that string pattern matching is built on top of the Perl-compatible regular expression library ( http://pcre.org ), so the problem I'm experiencing may not be specific to Mathematica.

+4
source share
3 answers

Here's the asciify function that I used as a workaround first:

 f[s_String] := s f[x_] := FromCharacterCode[x] asciify[s_String] := StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")] 

Then I realized, thanks to @Isaac the answer is that "." since regex doesn't seem to have this problem in Unicode. I found out from the answers Error in Mathematica: the regular expression is applied to a very long line that "(. | \ N)" is not recommended, but that "(? S)". Recommended. Therefore, I believe that the best solution is the following:

 StringReplace["–", RegularExpression@ "(?s)." -> "abc"] 
+3
source

Instead of RegularExpression I would use StringExpression . This works as desired:

 f[s_String] := StringReplace[s, _ -> "abc"] 

In StringExpression , Blank[] will match everything, including non-ASCII characters.

EDIT in response to version updates: with Mathematica 11.0.1, it looks like letters with character codes up to 2^16 - 1 (which is called as the maximum value for FromCharacterCode ), the results of StringMatchQ[LetterCharacter] now match the results of LetterQ .

 AllTrue[FromCharacterCode /@ Range[2^16 - 1], LetterQ@ # === StringMatchQ[#, LetterCharacter] &] (* True *) 
+3
source

Using "(.|\n)" for input in RegularExpression seems to work for me. The pattern matches . (any character without a newline) or \n (newline).

+1
source

Source: https://habr.com/ru/post/1305055/


All Articles