What causes char to be signed or unsigned when using gcc?

What causes if a char in C (using gcc) is signed or unsigned? I know that the standard does not dictate one above the other and that I can check CHAR_MIN and CHAR_MAX for limits.h, but I want to know which triggers work on each other when using gcc

If I read limit.h from libgcc-6, I see that there is a __CHAR_UNSIGNED__ macro that defines a "default" char signed or unsigned, but I'm not sure if this is set by the compiler in (its) built time.

I tried to list the predefined GCC macros with

 $ gcc -dM -E -xc /dev/null | grep -i CHAR #define __UINT_LEAST8_TYPE__ unsigned char #define __CHAR_BIT__ 8 #define __WCHAR_MAX__ 0x7fffffff #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 #define __SCHAR_MAX__ 0x7f #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) #define __UINT8_TYPE__ unsigned char #define __INT8_TYPE__ signed char #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 #define __CHAR16_TYPE__ short unsigned int #define __INT_LEAST8_TYPE__ signed char #define __WCHAR_TYPE__ int #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 #define __SIZEOF_WCHAR_T__ 4 #define __INT_FAST8_TYPE__ signed char #define __CHAR32_TYPE__ unsigned int #define __UINT_FAST8_TYPE__ unsigned char 

but could not find __CHAR_UNSIGNED__

Background: I have code that I compile on two different machines:

Desktop PC:

  • Debian GNU / Linux 9.1 (streaming)
  • gcc version 6.3.0 20170516 (Debian 6.3.0-18)
  • Intel (R) Core (TM) i3-4150
  • libgcc-6-dev: 6.3.0-18
  • char signed

Raspberry Pi3 :

  • Raspbian GNU / Linux 9.1 (streamer)
  • gcc version 6.3.0 20170516 (Raspbian 6.3.0-18 + rpi1)
  • ARMv7 Processor rev 4 (v7l)
  • libgcc-6-dev: 6.3.0-18 + rpi
  • char not specified

So the only obvious difference is the processor architecture ...

+48
c gcc
Sep 28 '17 at 7:12
source share
6 answers

According to the C11 standard (read n1570 ), char can be signed or unsigned (so you actually have two options for C). What exactly is specific implementation.

Some processors and instruction set architectures or binary application interfaces support the signed (byte) character type (for example, because it displays some machine code well ), others prefer unsigned .

gcc even has some -fsigned-char or -funsigned-char option that you should almost never use (because changing this interrupts some corner cases in the calling conventions and ABIs) unless you recompile everything, including the C standard library .

You can use feature_test_macros (7) and <endian.h> (see endian (3) ) or autoconf on Linux to determine what your system has.

In most cases, you should write portable C code that is independent of these things. And you can find cross-platform libraries (e.g. glib ) to help you with this.

BTW gcc -dM -E -xc /dev/null also provides __BYTE_ORDER__ etc., and if you want an 8-bit unsigned byte, you should use <stdint.h> and its uint8_t (more portable and more readable) . The standard limits.h defines CHAR_MIN and SCHAR_MIN and CHAR_MAX and SCHAR_MAX (you can compare them for equality to detect signed char ), etc.

By the way, you should take care of character encoding , but today most systems use UTF-8 everywhere . Libraries such as libunistring are useful. See also this and remember that in fact, a Unicode character encoded in UTF-8 can span multiple bytes (i.e. char -s).

+52
Sep 28 '17 at 7:19
source share

The default value depends on the platform and its own set of codes. For example, machines that typically use EBCDIC (usually mainframes) should use unsigned char (or have CHAR_BIT > 8 ), since C requires the characters in the base code set to be positive, and EBCDIC uses codes like 240 for the number 0 . (C11 Standard, Β§6.2.5 Types ΒΆ2 says: An object declared as a char is large enough to hold any member of the basic execution character set. If the element of the basic character execution set is stored in the char object, its value is guaranteed non-negative.)

You can control which character GCC uses with the -fsigned-char or -funsigned-char . Is this a good idea - this is a separate discussion.

+41
Sep 28 '17 at 7:25
source share

The char character type must be signed or unsigned , depending on the platform and compiler.

According to this link :

C and C ++ standards allow char character type or unsigned , depending on platform and compiler .

Most systems, including x86 GNU / Linux and Microsoft Windows, use signed char ,

but based on PowerPC and ARM processors, unsigned char is usually used . (29)

This can lead to unexpected results when porting programs between platforms that have different default values ​​for the char type.

GCC provides the -fsigned-char and -funsigned-char for setting the default char type.

+11
Sep 28 '17 at 7:46 on
source share

gcc has two compile-time options that control char behavior:

 -funsigned-char -fsigned-char 

It is not recommended to use any of these parameters unless you know exactly what you are doing.

The default value depends on the platform and is fixed when building gcc itself. It is selected for better compatibility with other tools that exist on this platform.

Source

+6
28 sept. '17 at 7:34 on
source share

On x86-64 Linux, at least it is defined by x86-64 System V psABI

Other platforms will have similar ABI standards documents that define rules that allow different C compilers to coordinate with each other when invoking conventions, structural layouts, and the like. (See x86 wiki tags for links to other x86 ABI documents or other places for other architectures. Most architectures other than x86 have only one or two standard ABIs.)

From x86-64 SysV ABI: Figure 3.1: Scalar Types

  C sizeof Alignment AMD64 (bytes) Architecture _Bool* 1 1 boolean ----------------------------------------------------------- char 1 1 signed byte signed char --------------------------------------------------------- unsigned char 1 1 unsigned byte ---------------------------------------------------------- ... ----------------------------------------------------------- int 4 4 signed fourbyte signed int enum*** ----------------------------------------------------------- unsigned int 4 4 unsigned fourbyte -------------------------------------------------------------- ... 

* This type is called bool in C ++.

*** C ++ and some implementations of C permissions allow more than internal. The base type is superimposed on unsigned int, long int or unsigned long int in that order.




Regardless of whether the char signed or not, it actually directly affects the calling convention in this case due to the currently undocumented requirement that clang relies on: narrow types are characters or zeros - expanded to 32 bits when transmitted as args functions , according to the prototype of the called party.

So for int foo(char c) { return c; } int foo(char c) { return c; } , clang will rely on the caller to have an extended arg sign. ( code + asm for this and calling on Godbolt ).

 gcc: movsx eax, dil # sign-extend low byte of first arg reg into eax ret clang: mov eax, edi # copy whole 32-bit reg ret 



Even despite the calling convention, C compilers must agree, so they compile inline functions in .h in the same way.

If (int)(char)x behaved differently in different compilers for the same platform, they would not be truly compatible.

+6
Sep 28 '17 at 14:09 on
source share

An important practical note is that the type of a UTF-8 string literal, such as u8"..." , is a char array, and it must be stored in UTF-8 format. The characters in the base set are guaranteed to be equivalent to positive integers. However,

If any other character is stored in a char object, the resulting value is determined by the implementation, but must be within the range of values ​​that can be represented in this type.

(In C ++, the UTF-8 string constant type is const char [] , and it is not indicated whether characters outside the base set have numeric representations at all.)

Therefore, if your program needs to collapse UTF-8 string bits, you will need to use unsigned char . Otherwise, any code that checks to see if the bytes of the UTF-8 string are in a specific range will not be portable.

It is better to explicitly point to unsigned char* than to write char , and expect the programmer to compile with the correct settings in order to configure it as unsigned char . However, you can use static_assert() to check if the char range includes all numbers from 0 to 255.

+1
Sep 28 '17 at 22:35
source share



All Articles