PHP - length of string containing emojis / special characters

I am creating an API for a mobile application and I seem to have a problem with counting the length of the string containing emojis. My code is:

$str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention"; printf("strlen: %d" . PHP_EOL, strlen($str)); printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8")); printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16")); printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str))); printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str))); 

answer:

 strlen: 27 mb_strlen UTF-8: 14 mb_strlen UTF-16: 13 iconv UTF-16: 14 iconv UTF-16: 27 

however I have to get 17 as a result. We tried to drown out the line length on iOS, Android and windows phone, this is 17 everywhere. iOS (fast) snippet:

 var str = "πŸ‘πŸΏβœŒπŸΏοΈ @mention" (str as NSString).length // 17 count(str) // 13 count(str.utf16) // 17 count(str.utf8) // 27 

We need to use NSString because of the library. I need this to get the start and end position of "@mention". If a line contains only text or only emojis, it works fine, so there may be a problem with mixed content.

What am I doing wrong? What other information can I provide you guys to help me in the right direction?

Thanks!

+6
source share
2 answers

Your functions count different things.

 Graphemes: πŸ‘ 🏿 ✌ 🏿️ @ mention 13 ----------- ----------- -------- --------------------- ------ ------ ------ ------ ------ ------ ------ ------ ------ Code points: U+1F44D U+1F3FF U+270C U+1F3FF U+FE0F U+0020 U+0040 U+006D U+0065 U+006E U+0074 U+0069 U+006F U+006E 14 UTF-16 code units: D83D DC4D D83C DFFF 270C D83C DFFF FE0F 0020 0040 006D 0065 006E 0074 0069 006F 006E 17 UTF-16-encoded bytes: 3D D8 4D DC 3C D8 FF DF 0C 27 3C D8 FF DF 0F FE 20 00 40 00 6D 00 65 00 6E 00 74 00 69 00 6F 00 6E 00 34 UTF-8-encoded bytes: F0 9F 91 8D F0 9F 8F BF E2 9C 8C F0 9F 8F BF EF B8 8F 20 40 6D 65 6E 74 69 6F 6E 27 

PHP strings are originally bytes.

strlen() counts the number of bytes in a string: 27.

mb_strlen(..., 'utf-8') counts the number of code points (Unicode characters) in a string when its bytes are decoded into characters using UTF-8: 14 encoding.

(Other example examples are pretty much pointless, because they are based on processing the input string as one encoding, when in fact it contains data in a different encoding.)

NSStrings are counted as UTF-16 code units. There are 17, not 14, because the specified string contains characters of type πŸ‘ that do not fit into a single code block UTF-16, so they must be encoded as a surrogate pair. There are no functions that will count lines in UTF-16 code modules in PHP, but since each block of code is encoded up to two bytes, you can easily execute it by encoding UTF-16 and dividing the number of bytes by two:

 strlen(iconv('utf-8', 'utf-16le', $str)) / 2 

(Note: the le suffix is ​​needed to make iconv encoding specific UTF-16 content, rather than resetting the score by selecting it and adding a specification to the beginning of the line to say that the one that he selected.)

+12
source

I included a picture to help illustrate the answer @bobince gave.

In fact, all code points without a surrogate pair end as two bytes in UTF-16, while all surrogate paracodes end as four bytes. If we divide this into two, we get the equivalent expected length value.

PS Please forgive the error in the image where it says "code points" and you should say "code units"

unicode breakdown

+4
source

Source: https://habr.com/ru/post/988405/


All Articles