I am trying to find a solution for capitalization in perl webapp (using perl v5.10.1). I originally thought of using Lingua :: EN :: NameCase, but I see some problems with accented characters.
I need to deal with accented characters from different European languages (Irish, French, German).
I saw some evidence on the Internet that Lingua :: EN :: NameCase should work for my use. For example, this page at perlmonks: http://www.perlmonks.org/?node_id=889135
Here is my test code based on the link above:
#!/usr/bin/perl use strict; use warnings; use Lingua::EN::NameCase; use locale; use POSIX qw(locale_h); my $locale = 'en_FR.utf8'; setlocale( LC_CTYPE, $locale ); binmode DATA, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while (my $original_name = <DATA>) { chomp $original_name; my $normalized_name = nc($original_name); printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name); } sub xlc { my $str = shift; $_ = lc( $str ); return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) ); }; __DATA__ ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS Seán Ó Hannracháín Máire Ó hÓgartaigh
Produces a conclusion below. Both L :: EN :: NC and the custom solution ucfirst (lc ()) produce incorrect results (note the uppercase letters corresponding to each accented character). This seems to be due to the fact that the regular expression perl matches the "word boundary" before / after each accented character. I would expect the word boundary to correspond only to whitespace and non-space.
Can anyone suggest a solution?
Thanks,
Brian.
ÉTIENNE DE LA BOÉTIE L::EN::NC éTienne de la BoéTie UCFIRST ÉTienne De La BoÉTie ÉMILIE DU CHÂTELET L::EN::NC éMilie du ChâTelet UCFIRST ÉMilie Du ChÂTelet HÉLÈNE CIXOUS L::EN::NC HéLèNe Cixous UCFIRST HÉLÈNe Cixous Seán Ó Hannracháín L::EN::NC SeáN ó HannracháíN UCFIRST SeÁN ó HannrachÁíN Máire Ó hÓgartaigh L::EN::NC MáIre ó HóGartaigh UCFIRST MÁIre ó HÓGartaigh
source share