C Windows / Linux String Encoding

Question

C Windows / Linux String Encoding

If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:

Windows: strlen("ö") = 1 Linux: strlen("ö") = 2

On a Windows machine, the string is obviously encoded in the extended ascii format as 0xF6 , while on a Linux machine it is encoded in UTF-8 using 0xC3 0x96 , which gives a length of 2 characters.

Question:

Why is line C encoded differently on a computer running Windows and Linux?

The question arose in a discussion that I had with another forum member in a code review ( see this topic ).

+6

c c-strings

Frode akselsen Dec 24 '16 at 1:39

source share

1 answer

chux · Accepted Answer · 2016-12-24T03:36:57+0000

Why is line C encoded differently on a computer running Windows and Linux?

Firstly, this is not a Windows / Linux (operating system) problem, but the compiler exists as a compiler on Windows, which is encoded as gcc (regular on Linux).

This is allowed by C , and two compiler vendors have mapped out different implementations according to their own programming goals, MS using CP-1252 and Linux using Unicode . @Danh . Preliminary MS Unicode selection dates. Not surprisingly, different compiler manufacturers use different solutions.

5.2.1 Character Sets
1 Two character sets and the sorting sequences associated with them must be defined: the set in which the source files are recorded (the source set of characters ), and the set interpreted in the runtime (the set of run characters). Each set is further subdivided into the basic set of characters, contents which is specified by this subclause, and a set of zero or more language elements (which are not members of the basic character set), called extended characters . The combination set is also called the extended character set. The values of the execution character set elements are defined upon implementation . C11dr §5.2.1 1 (My emphasis)

 strlen("ö") = 1 strlen("ö") = 2

"ö" encoded for extended compiler source character characters.

I suspect that MS is focused on maintaining its code base and encourages other languages. Linux is just an earlier C Unicode adapter, although MS was an early influencer Unicode.

As Unicode support grows , I expect this to be the solution of the future.

C Windows / Linux String Encoding

Question:

More articles: