Header lines containing accented characters

Question

Header lines containing accented characters

I am trying to find a solution for capitalization in perl webapp (using perl v5.10.1). I originally thought of using Lingua :: EN :: NameCase, but I see some problems with accented characters.

I need to deal with accented characters from different European languages (Irish, French, German).

I saw some evidence on the Internet that Lingua :: EN :: NameCase should work for my use. For example, this page at perlmonks: http://www.perlmonks.org/?node_id=889135

Here is my test code based on the link above:

#!/usr/bin/perl use strict; use warnings; use Lingua::EN::NameCase; use locale; use POSIX qw(locale_h); my $locale = 'en_FR.utf8'; setlocale( LC_CTYPE, $locale ); binmode DATA, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while (my $original_name = <DATA>) { chomp $original_name; my $normalized_name = nc($original_name); printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name); } sub xlc { my $str = shift; $_ = lc( $str ); return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+)/g ) ); }; __DATA__ ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS Seán Ó Hannracháín Máire Ó hÓgartaigh

Produces a conclusion below. Both L :: EN :: NC and the custom solution ucfirst (lc ()) produce incorrect results (note the uppercase letters corresponding to each accented character). This seems to be due to the fact that the regular expression perl matches the "word boundary" before / after each accented character. I would expect the word boundary to correspond only to whitespace and non-space.

Can anyone suggest a solution?

Thanks,

Brian.

  ÉTIENNE DE LA BOÉTIE L::EN::NC éTienne de la BoéTie UCFIRST ÉTienne De La BoÉTie ÉMILIE DU CHÂTELET L::EN::NC éMilie du ChâTelet UCFIRST ÉMilie Du ChÂTelet HÉLÈNE CIXOUS L::EN::NC HéLèNe Cixous UCFIRST HÉLÈNe Cixous Seán Ó Hannracháín L::EN::NC SeáN ó HannracháíN UCFIRST SeÁN ó HannrachÁíN Máire Ó hÓgartaigh L::EN::NC MáIre ó HóGartaigh UCFIRST MÁIre ó HÓGartaigh

+6

regex perl capitalization unicode

Brian foley Oct 16 '13 at 6:43

source share

4 answers

Jjoo · Answer 1 · 2014-02-18T19:41:57+0000

Perl 5.10 is old; you should update it if you can.

Below you will find the version that I use for such situations. (verified in perl 5.14.2)

 #!/usr/bin/perl use strict; use warnings; use utf8::all; while (<DATA>) { chomp; printf "%30s ==> %30s\n", $_, xlc($_); } sub xlc { my $str = shift; $str =~ s/(\w+)/ucfirst(lc($1))/ge; $str =~ s/( L[ea]s? | Von | D[aeou]s? )\b /lc($1)/xge; return $str; }; __DATA__ ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS Seán Ó Hannracháín Máire Ó hÓgartaigh

Bohdan · Answer 2 · 2013-10-16T14:42:57+0000

If your data is in UTF8, you should decode it for internal perl encoding:

  utf8::decode($original_name); my $normalized_name = nc($original_name); printf "%30s L::EN::NC %30s UCFIRST %30s\n", $original_name, $normalized_name, xlc($original_name);

Mark nodine · Answer 3 · 2014-04-15T01:52:44+0000

OK, I just got your script to work. Here is the result I got:

  ÉTIENNE DE LA BOÉTIE L::EN::NC Étienne de la Boétie UCFIRST Étienne De La Boétie ÉMILIE DU CHÂTELET L::EN::NC Émilie du Châtelet UCFIRST Émilie Du Châtelet HÉLÈNE CIXOUS L::EN::NC Hélène Cixous UCFIRST Hélène Cixous Seán Ó Hannracháín L::EN::NC Seán Ó Hannracháín UCFIRST Seán Ó Hannracháín Máire Ó hÓgartaigh L::EN::NC Máire Ó Hógartaigh UCFIRST Máire Ó Hógartaigh

I had to change two things:

I commented on binmode calls, as they are not needed with any encoding that I used emacs on my system. Your mileage may vary. If you make a mistake, you will see warnings about characters that are not displayed in Unicode or wide characters.
I changed local. You told him to use the English language locale in France. I am not sure if this is the right language. I chose a local one that actually uses accented characters.

Unfortunately, the locale names are not standardized, but the following language worked for me:

 my $locale = 'fr_FR.utf-8';

In particular, he did not work without a hyphen.

Pierre · Answer 4 · 2014-06-10T22:44:56+0000

In fact, you just need the utf8 pragma.

 use utf8; binmode STDOUT, ':utf8'; while (my $name = <DATA>) { $name =~ s/(\w+)/ucfirst lc $1/eg; print $name; } __DATA__ ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS Seán Ó Hannracháín Máire Ó hÓgartaigh

I get:

 Étienne De La Boétie Émilie Du Châtelet Hélène Cixous Seán Ó Hannracháín Máire Ó Hógartaigh

Header lines containing accented characters

More articles: