Why can't I use in my XML output when declared as UTF-8?

Question

Why can't I use in my XML output when declared as UTF-8?

I have the symbol "N Tilde" in my DB2 Z / OS database. I am generating an XML file from data. In XML, I have encoding=UTF-8 , however Internet Explorer gives me an Illegal character in text field error. If I change the encoding to ISO-8859-1, it works fine.

I thought ISO-8859-1 was a subset of UTF-8, so why doesn't it work with UTF-8?

Is UTF-8 the best for an XML document?

+4

java unicode utf-8 character-encoding iso-8859-1

Tim Feb 23 '11 at 15:13

source share

4 answers

UTF-8 ≠ Unicode

Note:

ASCII is a subset of the ISO 8859-1 standard.
ASCII is a subset of Unicode.
ASCII is a subset of UTF-8.
ISO 8859-1 is a subset of Unicode.
ISO 8859-1 is not a subset of UTF-8.
Unicode is not the same as UTF-8.

I highly recommend familiarizing yourself with the intricacies of modern terminology .

If this is too confusing, you can watch the Radix-50 , which has a repertoire of an order of magnitude smaller than Unicode, but nevertheless exhibits several of the same subtleties that now come out of people in relation to Unicode, the character repertoire, coded character sets, character encoding forms and character encoding schemes.

Java `chars` Unable to hold characters

Since you came to this with Java, it really is not your fault that these arent clearly sharing the concepts in your mind. This is because Java seriously confuses this problem by not separating the paragraph codes (logical characters) of the encoded character set from the empty and dirty mechanisms of one particular form of character encoding / STRONG>.

Javas unfortunate conflation chars with logical symbols are error prone in extreme mode; perhaps it would be more accurate to say that Java programmers are united in the same thing. In any case, now, there seems to be no hope for a cure.

Blame it all on hysterical porpoises if you want, but the most charitable thing you can say about it is that it is very unfortunate. Because of all this, sane and perfectly competent programmers, like you, will be easily confused forever, and therefore will constantly write Java code that is simple, clear and erroneous.

Education about all of this is the only possible palliative, but this is not a true cure.

+2

tchrist Feb 23 '11 at 15:41

source share

ISO-8859-1 is not at all a subset of UTF-8. ASCII is a subset of both ISO-8859-1 and UTF-8. They are specifically distinguished for characters in the range of Unicode codes U + 0080 - U + 00FF.

In ISO-8859-1, the character "C" (U + 00D1 LATIN CAPITAL LETTER N WITH TILDE) is represented as a single byte D1 . In UTF-8, the same character is represented by two byte sequences of C3 91 .

+1

Avi Feb 23 '11 at 15:20

source share

The best way to create XML in Java is to use the XML library - this also ensures that everything is well-formed.

If you must create it manually, it is best to use new OutputStreamWriter(stream, encoding) , where the encoding is the same encoding that you write in the XML preamble.

Also, make sure the rows received from your database are encoded correctly.

0

Paŭlo Ebermann Feb 23 '11 at 19:23

source share

Joachim sauer · Accepted Answer · 2011-02-23T15:16:11+0000

ISO-8859-1 is not a subset of UTF-8. It may be a subset of the characters represented in UTF-8, but it does not do so in the same way.

Both ISO-8859-1 and UTF-8 are supersets of ASCII (that is, they can represent all the characters that ASCII can represent, and they represent them in the same way).

Thus, you cannot just mark the ISO-8859-1 data as UTF-8 and hope that this works, you need to actually store (or convert) your data as UTF-8.

Why can't I use in my XML output when declared as UTF-8?

UTF-8 ≠ Unicode

Java chars Unable to hold characters

More articles:

Java `chars` Unable to hold characters