Unicode string lengths vary

Question

Unicode string lengths vary

Why the length of the following lines is different, although the number of characters in the lines is the same

echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>"; echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";

Outputs

 35 26

+6

php unicode

Imran mar bukhsh 24 sept '11 at 6:58

source share

5 answers

I'm not a PHP expert, but it seems like strlen it counts bytes ... there is mb_strlen that counts characters ...

EDIT - For more help on how multibyte encoding works, see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and

+8

Yahia 24 sept '11 at 7:02

source share

It looks like it is counting the number of bytes in the encoding used. For example, it looks like the second line takes two bytes per non-space character, while the first line takes three bytes per non-space character. I would expect:

 echo strlen("ABCDEFGHI")

to print 17 - one byte per ASCII character.

My guess is that all of this uses UTF-8 encoding, which will undoubtedly fit the width of the view.

+2

Jon skeet 24 sept '11 at 7:02

source share

Use mb_strlen , it counts the characters in the provided encoding, not bytes like strlen

+1

Mirrorce soaica 24 sept '11 at 7:02

source share

According to this post on php.net/strlen , PHP interprets all strings passed to strlen as ASCII.

+1

Rusty fausak 24 sept '11 at 7:02

source share

Niet the dark absol · Accepted Answer · 2011-09-24T07:01:57+0000

The first batch of characters takes up three bytes each because they go down in the 39,000th character list, while the second group takes only two bytes, which is about 400. (The number of bytes / octets required for each character is discussed in the UTF article 8 wikipedia .)

strlen counts the number of bytes taken by a string that produces such odd Unicode results.

Unicode string lengths vary

More articles: