Perl string length independent of character encoding

Question

Perl string length independent of character encoding

The length function assumes that Chinese characters have more than one character. How to determine the length of a string in Perl regardless of character encoding (treat Chinese characters as one character)?

+4

string perl unicode character-encoding

syker Mar 03 '11 at 5:56

source share

1 answer

mu is too short · Answer 1 · 2011-03-03T06:20:19+0000

The length function works with characters, not octets (AKA bytes). The definition of a character depends on the encoding. Chinese characters are still single characters (if the encoding is set correctly!), But they occupy more than one octet of space. Thus, the length of a string in Perl depends on the character encoding that Perl considers a string; the only line length that is independent of character encoding is a simple byte length.

Make sure the specified string is marked UTF-8 and encoded in UTF-8. For example, this gives 3:

 $ perl -e 'print length("长")'

whereas it gives 1:

 $ perl -e 'use utf8; print length("长")'

as well as:

 $ perl -e 'use Encode; print length(Encode::decode("utf-8", "长"))'

If you get your Chinese characters from a file, make sure you have a binmode $fh, ':utf8' file before reading or writing; if you are retrieving data from a database, make sure the database returns strings in UTF-8 format (or use Encode to do this for you).

I don’t think you need to have everything in UTF-8, you really only need to make sure that the line is marked as having the correct encoding. I would go with UTF-8 from front to back (and even sideways), although, like lingua franca for Unicode, this will make it easier for you if you use it everywhere.

You can spend some time reading the perlunicode man page if you intend to deal with data other than ASCII.

Perl string length independent of character encoding

More articles: