Strange UTF8 string comparison

I am having a problem with UTF8 string matching, which I really have no idea about, and it starts to give me a headache. Please help me.
I basically have this line from an XML document encoded in UTF8: "Mina Tidigare anställningar"
And when I compare this line with exactly the same line that I typed myself: "Mina Tidigare anställningar" (also in UTF8). And the result is FALSE !!!
I have no idea why. This is so strange. Can anybody help me?

+3
source share
3 answers

This seems somewhat significant . To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: řcan be written as one character řor as two characters: rand a union ˇ.

The normalizer class is best - normalize both lines in the same normalization form and compare the results.

In one of the comments, you show these hexadecimal representations of strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

Pay attention to the parts noted by me, apparently, there are two parts to this problem.

  • -, "c2a0" - - XML . , "". , PHP, .

  • , : c3a4 ä (U + 00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - , ), 61 a (U + 0061 "LATIN SMALL LETTER A" - , ) cc88 umlaut " (U + 0308 " " - , ). .

+21

: , UTF-8 ( ). - UTF8, - .

+2

mb_detect_encoding ($ s, "UTF-8") == "UTF-8" ?: $ s = utf8_encode ($ s);

0
source

Source: https://habr.com/ru/post/1763089/


All Articles