Text :: SpellChecker and Unicode Module

Question

Text :: SpellChecker and Unicode Module

#!/usr/local/bin/perl use strict; use warnings; use Text::SpellChecker; my $text = "coördinator"; my $checker = Text::SpellChecker->new( text => $text ); while ( my $word = $checker->next_word ) { print "Bad word is $word\n"; }

Exit: Bad word is rdinator

Desired: Bad word is coördinator

The module breaks if I have Unicode in $text . Any idea how this can be resolved?

I have installed Aspell 0.50.5, which is used by this module. I think this may be a criminal.

Edit: Text::SpellChecker requires Text::Aspell or Text::Hunspell , I uninstalled Text::Aspell and installed Hunspell , Text::Hunspell , and then:

 $ hunspell -d en_US -l < badword.txt coördinator

Shows the correct result. This means that something is wrong with my code or Text :: SpellChecker.

Considering Miller's suggestion, I did below

 #!/usr/local/bin/perl use strict; use warnings; use Text::SpellChecker; use utf8; binmode STDOUT, ":encoding(utf8)"; my $text = "coördinator"; my $flag = utf8::is_utf8($text); print "Flag is $flag\n"; print "Text is $text\n"; my $checker = Text::SpellChecker->new(text => $text); while (my $word = $checker->next_word) { print "Bad word is $word\n"; }

OUTPUT:

 Flag is 1 Text is coördinator Bad word is rdinator

Does this mean that the module is not able to correctly process utf8 characters?

+6

perl unicode utf-8 perl-module

Chankey pathak Nov 03 '14 at 4:46

source share

2 answers

AnFi · Answer 1 · 2014-11-03T09:04:36+0000

Error Text :: SpellChecker - the current version assumes only ASCII words.

http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm

 # # next_word # # Get the next misspelled word. # Returns false if there are no more. # sub next_word { ... while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {

IMHO the best fix will be to use for each dictionary / dictionary a word that splits the regular expression or , dividing the word into the subclass library used. aspell list tells coördinator as one word.

Brian · Answer 2 · 2014-11-04T03:02:35+0000

I turned on the Chankey solution and released version 0.12 in CPAN, try it.

The authenticity of diaresis in words like coördinator is interesting. By default, the aspell and hunspell dictionaries are marked as incorrect, although some publications may not match.

better Brian

Text :: SpellChecker and Unicode Module

More articles: