How to remove diacritics in Perl 6

Two related questions. Perl 6 is so smart that it understands the grapheme as one character, whether it be one Unicode character (for example, ä , U+00E4 ) or two or more combined characters (for example, and ḏ̣ ). This little code

 my @symb; @symb.push("ä"); @symb.push("p" ~ 0x304.chr); # "p̄" @symb.push("ḏ" ~ 0x323.chr); # "ḏ̣" say "$_ has {$_.chars} character" for @symb; 

gives the following result:

 ä has 1 character p̄ has 1 character ḏ̣ has 1 character 

But sometimes I would like to be able to do the following. 1) Remove diacritics from ä . So I need some kind of method like

 "ä".mymethod → "a" 

2) Divide the "combined" characters into parts, i.e. divide by p and Combining Macron U+0304 . For instance. something like the following in bash :

 $ echo p̄ | grep . -o | wc -l 2 
+5
source share
3 answers

Perl 6 has excellent support for Unicode processing in the Str class. To accomplish what you specify in (1), you can use the samemark method.

In the documentation:

 multi sub samemark(Str:D $string, Str:D $pattern --> Str:D) method samemark(Str:D: Str:D $pattern --> Str:D) 

Returns a copy of $string with character / accent information for each character changed, so that it matches the character / accent of the corresponding character in $pattern . If $string longer than $pattern , the remaining characters in $string get the same character / accent as the last character in $pattern . If $pattern empty, there will be no changes.

Examples:

 say 'åäö'.samemark('aäo'); # OUTPUT: «aäo␤» say 'åäö'.samemark('a'); # OUTPUT: «aao␤» say samemark('Pêrl', 'a'); # OUTPUT: «Perl␤» say samemark('aöä', ''); # OUTPUT: «aöä␤» 

This can be used both to remove marks / diacritics from letters, and to add them.

There are several ways to do this for (2) (TIMTOWTDI). If you need a list of all code points in a string, you can use the ords method to get a List (technically a Positional ) of all code points in a string.

 say "p̄".ords; # OUTPUT: «(112 772)␤» 

You can use the uniname method to get the Unicode name for the code point:

 .uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤» 

or just use the uninames method / routine:

 .say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤» 

If you just need the number of code points per line, you can use codes :

 say "p̄".codes; # OUTPUT: «2␤» 

This is different from chars , which simply counts the number of characters in a string:

 say "p̄".chars; # OUTPUT: «1␤» 

Also see @hobbs answer using NFD .

+3
source

This is the best I could come up with from the documents - maybe a simpler way, but I'm not sure.

 my $in = "Él está un pingüino"; my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str; say $stripped; # El esta un pinguino 

The .NFD method converts the string to a normalization form D (expanded), which separates graphemes from the base code points and combines code points when possible. Then grep returns a list of only those code points that do not have the "Grapheme_Extend" property, i.e. Removes combined code points. Uni.new(...).Str then collects these code points back into a string.

You can also combine these pieces to answer the second question; eg:.

 $in.NFD.map: { Uni.new($_).Str } 

will return a list of 1-character strings, each with one expanded code number, or

 $in.NFD.map(&uniname).join("\n") 

will make a nice little unicode debugger.

+3
source

I cannot say that it is better or faster, but I share diacritics as follows:

 my $s = "åäö"; say $s.comb.map({.NFD[0].chr}).join; # output: "aao" 
+1
source

Source: https://habr.com/ru/post/1272706/


All Articles