Printf field width: bytes or characters?

The printf / fprintf / sprintf family supports a width field in its format specifier. I doubt for the case of the (not wide) argument of char arrays:

Does the width field mean bytes or characters?

What is the (de facto correct) behavior if the char array matches (say) the original UTF-8 string? (I know that usually I should use some kind of wide char type this is not the point)

For example, in

char s[] = "ni\xc3\xb1o";  // utf8 encoded "niño"
fprintf(f,"%5s",s);

This function should try to output only 5 bytes (simple C characters) (and you take the responsibility of misalignment or other problems if two bytes lead to text characters)?

Or should he try to calculate the length of the "text characters" from the array? (decoding ... according to the current language?) (in the example, this will mean that the string has 4 unicode characters, so it will add a space to fill in).

The UPDATE . I agree with the answers, it is logical that the printf family does not work to distinguish simple C characters from bytes. The problem is that my glibc doest does not seem to fully respect this concept if the locale was installed earlier, and if one has (most commonly used today) LANG / LC_CTYPE = en_US.UTF-8

Example:

#include<stdio.h>
#include<locale.h>
main () {
        char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
        char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
        printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
        printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
        char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
        printf("|%s|\n",s3);     /* print raw chars - ok */
        printf("|%.*s|\n",15,s3);     /* panics (why???) */
}

, ​​ -POSIX-C, printf, , : (c ), Unicode. . char, , ( - "|" - )... . , utf-8, /. glibc?

glibc 2.11.1 (Fedora 12) ( glibc 2.3.6)

: - od: $ ./a.out | od -t cx1 :

0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |
         7c  41  b1  42  7c  0a  7c

2 ( 2015 .). glibc ( 2.17, ). glibc-2.17-21.fc19 .

+3
6

. . ISO C . 8 , char.

ISO 8- .

"niño" C (, ). , , C.

, C Unicode. UTF-32, CHAR_BITS 32. UTF-8 , , : -)


, . . printf.

| ( , , , , , , ), GNU ( ). , ​​ , - , ( ).


, , , od. :

pax> ./qq | od -t cx1
0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |   A 261   B   |  \n
         7c  41  b1  42  7c  0a  7c  41  b1  42  7c  0a
0000034

, UTF-8, , , . C/glibc , , , , .

, , od ( , , , ), , C/glibc, - , ( , , , (.. |A)) - , |, , ). .

+3

(). Unicode. , 5 fputc.

+2

( ?) : , glibc, ( ) printf C ( C, ). , fprintf(f,"%5s",s) " 5 ( ) s -if , ".

, ( , 5) , -say-UTF8, 4 " (unicode) ". printf(), , 5 () C, .

, . . - .

glibc-, ( ) - , ...

http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308

http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

:

ISO C99 requires for %.*s to only write complete characters that fit below the
precision number of bytes.  If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That not a bug in glibc.

(, ISO), . , glibc, .

: printf("|%.*s|\n",15,s3). glibc , s3 15 , , . . , glibc : 15 , , , ( - glibc ISO C99). , , char, , , . , , LC_TYPE UTF-8, UTF-8, ( , printf -1; , , ).

-, , , glibc Unicode /. , , .

. , , . :

char s[] = "ni\xc3\xb1o";  /* "niño" in UTF8: 5 bytes, 4 unicode chars */
printf("|%.3s|",s); /* would cut the double-byte UTF8 char in two */

Thi 2 , 3, UTF8:

$ ./a.out
|ni|
$ ./a.out | od -t cx1
0000000   |   n   i   |  \n
        7c 6e 69 7c 0a

UPDATE ( 2015 .) (IMO) () glib. . .

+1

, mbstowcs printf( "%6ls", wchar_ptr ).

%ls - POSIX.

-. , , stdout UTF-8, UTF-8, , printf , .

0

Do not use mbstowcs unless you are also convinced that wchar_t is at least 32 bits long. otherwise you will most likely end up with UTF-16, which has all the flaws of UTF-8 and all the flaws of UTF-32.

I am not saying avoid mbstowcs. I'm just saying don't let Windows programmers use it.

It may be easier to use iconv to convert to UTF-32.

0
source

Source: https://habr.com/ru/post/1744578/


All Articles