C Windows / Linux String Encoding

If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:

Windows: strlen("รถ") = 1 Linux: strlen("รถ") = 2 

On a Windows machine, the string is obviously encoded in the extended ascii format as 0xF6 , while on a Linux machine it is encoded in UTF-8 using 0xC3 0x96 , which gives a length of 2 characters.

Question:

Why is line C encoded differently on a computer running Windows and Linux?


The question arose in a discussion that I had with another forum member in a code review ( see this topic ).

+6
source share
1 answer

Why is line C encoded differently on a computer running Windows and Linux?

Firstly, this is not a Windows / Linux (operating system) problem, but the compiler exists as a compiler on Windows, which is encoded as gcc (regular on Linux).

This is allowed by C , and two compiler vendors have mapped out different implementations according to their own programming goals, MS using CP-1252 and Linux using Unicode . @Danh . Preliminary MS Unicode selection dates. Not surprisingly, different compiler manufacturers use different solutions.

5.2.1 Character Sets
1 Two character sets and the sorting sequences associated with them must be defined: the set in which the source files are recorded (the source set of characters ), and the set interpreted in the runtime (the set of run characters). Each set is further subdivided into the basic set of characters, contents which is specified by this subclause, and a set of zero or more language elements (which are not members of the basic character set), called extended characters . The combination set is also called the extended character set. The values โ€‹โ€‹of the execution character set elements are defined upon implementation . C11dr ยง5.2.1 1 (My emphasis)

 strlen("รถ") = 1 strlen("รถ") = 2 

"รถ" encoded for extended compiler source character characters.

I suspect that MS is focused on maintaining its code base and encourages other languages. Linux is just an earlier C Unicode adapter, although MS was an early influencer Unicode.

As Unicode support grows , I expect this to be the solution of the future.

+5
source

Source: https://habr.com/ru/post/1013487/


All Articles