Unicode string lengths vary

Why the length of the following lines is different, although the number of characters in the lines is the same

echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>"; echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>"; 

Outputs

 35 26 
+6
source share
5 answers

The first batch of characters takes up three bytes each because they go down in the 39,000th character list, while the second group takes only two bytes, which is about 400. (The number of bytes / octets required for each character is discussed in the UTF article 8 wikipedia .)

strlen counts the number of bytes taken by a string that produces such odd Unicode results.

+8
source

I'm not a PHP expert, but it seems like strlen it counts bytes ... there is mb_strlen that counts characters ...

EDIT - For more help on how multibyte encoding works, see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and

+8
source

It looks like it is counting the number of bytes in the encoding used. For example, it looks like the second line takes two bytes per non-space character, while the first line takes three bytes per non-space character. I would expect:

 echo strlen("ABCDEFGHI") 

to print 17 - one byte per ASCII character.

My guess is that all of this uses UTF-8 encoding, which will undoubtedly fit the width of the view.

+2
source

Use mb_strlen , it counts the characters in the provided encoding, not bytes like strlen

+1
source

According to this post on php.net/strlen , PHP interprets all strings passed to strlen as ASCII.

+1
source

Source: https://habr.com/ru/post/897954/


All Articles