Strange UTF8 string comparison

Question

Strange UTF8 string comparison

I am having a problem with UTF8 string matching, which I really have no idea about, and it starts to give me a headache. Please help me.
I basically have this line from an XML document encoded in UTF8: "Mina Tidigare anställningar"
And when I compare this line with exactly the same line that I typed myself: "Mina Tidigare anställningar" (also in UTF8). And the result is FALSE !!!
I have no idea why. This is so strange. Can anybody help me?

+3

string xml php utf-8

James Sep 03 '10 at 14:08

source share

3 answers

: , UTF-8 ( ). - UTF8, - .

+2

kriss 03 . '10 14:15

mb_detect_encoding ($ s, "UTF-8") == "UTF-8" ?: $ s = utf8_encode ($ s);

0

DmitryK Sep 03 '10 at 14:18

source share

Piskvor · Accepted Answer · 2010-09-03T14:17:40+0000

This seems somewhat significant . To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: řcan be written as one character řor as two characters: rand a union ˇ.

The normalizer class is best - normalize both lines in the same normalization form and compare the results.

In one of the comments, you show these hexadecimal representations of strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

Pay attention to the parts noted by me, apparently, there are two parts to this problem.

-, "c2a0" - - XML . , "". , PHP, .
, : c3a4 ä (U + 00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - , ), 61 a (U + 0061 "LATIN SMALL LETTER A" - , ) cc88 umlaut " (U + 0308 " " - , ). .

Strange UTF8 string comparison

More articles: