What is XML encoding?

What is XML encoding? Common coding is utf-8. How does it differ from another encoding? What is the purpose of using it?

+6
source share
4 answers

A character encoding determines how characters are mapped to bytes. Since XML documents are stored and transmitted as byte streams, this is necessary to represent the Unicode characters that make up the XML document.

UTF-8 is selected by default because it has several advantages:

  • it is compatible with ASCII in that all valid ASCII text is also valid in UTF-8 encoding (but not necessarily the other way around).
  • it uses only 1 byte per character for "regular" letters (those that also exist in ASCII)
  • it can represent all existing Unicode characters

Character encodings are a more general topic than just XML. UTF-8 is not limited to use only in XML.

What every programmer absolutely, positively needs to know about encodings and character sets for working with text is a good article that gives a good overview of the topic.

+8
source

When computers were first created, they mainly worked only with characters found in English, which led to the 7-bit US-ASCII standard.

However, there are many different written languages โ€‹โ€‹in the world, and ways must be found to use them in computers.

The first method works fine if you restrict yourself to a particular language, it uses a culture-specific encoding such as ISO-8859-1, which can represent Latin-European characters in 8 bits or GB2312 for Chinese characters.

The second method is a bit more complicated, but theoretically allows you to represent each character in the world, this is the Unicode standard, in which each character from each language has a specific code. However, given the large number of existing characters (109,000 in Unicode 5), Unicode characters are typically represented using a three-byte representation (one byte for the Unicode plane and two bytes for the character code.

To maximize compatibility with existing code (some still use ASCII text), the standard UTF-8 encoding was designed as a way to store Unicode characters using only minimal space, as described in Joachim Sauer answer.

Thus, to view files encoded using certain encodings, such as ISO-8859-1, a file is usually used to be edited or read only by software (and people) who understand only these languages, and UTF-8, when available should be very interoperable and culturally independent. The current trend is that UTF-8 is replacing other encodings, even though they need work from software developers, since UTF-8 strings are more difficult to process than fixed-width encoding strings.

+4
source

XML documents may contain non-ASCII characters, such as Norwegian รฆ รธ รฅ or French รช รจ รฉ. Thus, to avoid errors, you set the encoding or save the XML file as Unicode.

XML Encoding Rules

+2
source

When data is stored or transmitted, it is only bytes. These bytes require some interpretation. Users with non-English languages โ€‹โ€‹had problems with characters that appeared only in their locale. These characters were not displayed correctly.

With XML having information on how to interpret its byte character, it can be displayed correctly.

+1
source

Source: https://habr.com/ru/post/885896/


All Articles