The width specifier in printf does not work properly with accented characters

I am trying to format the output of some lines in c using the width specifier and the printf function. However, I have problems with the behavior I want. It seems that every time printf collides with the character å, ä or ö, the width reserved for the line is reduced by one position.

Code snippet for illustration:

#include <stdio.h> int main(void) { printf(">%-10s<\n", "aoa"); printf(">%-10s<\n", "aäoa"); printf(">%-10s<\n", "aäoöa"); printf(">%-10s<\n", "aäoöaå"); return 0; } 

The output is in my ubuntu linux bash -shell.

 >aoa < >aäoa < >aäoöa < >aäoöaå < 

I am looking for advice on how to deal with this. I want all lines in the above snippet to print in a wide field of width 10 char as follows:

 >aoa < >aäoa < >aäoöa < >aäoöaå < 

I also appreciate any understanding of why this is happening or feedback if this is not a problem with other settings.

+5
source share
3 answers

Why is this happening?

Take a look at Absolute Minimum Every software developer Absolutely, must be positive about Unicode and character sets

As an alternative to wide characters and under UTF8, you can use this function to count the number of characters other than ASCII, then you can add the result to the printf width specifier:

 #include <stdio.h> int func(const char *str) { int len = 0; while (*str != '\0') { if ((*str & 0xc0) == 0x80) { len++; } str++; } return len; } int main(void) { printf(">%-*s<\n", 10 + func("aoa"), "aoa"); printf(">%-*s<\n", 10 + func("aäoa"), "aäoa"); printf(">%-*s<\n", 10 + func("aäoöa"), "aäoöa"); printf(">%-*s<\n", 10 + func("aäoöaå"), "aäoöaå"); return 0; } 

Output:

 >aoa < >aäoa < >aäoöa < >aäoöaå < 
+5
source

Use wide character strings and wprintf :

 #include <cwchar> #include <locale.h> int main(void) { // seems to be needed for the correct output encoding setlocale(LC_ALL, ""); wprintf(L">%-10ls<\n", L"aoa"); wprintf(L">%-10ls<\n", L"aäoa"); wprintf(L">%-10ls<\n", L"aäoöa"); wprintf(L">%-10ls<\n", L"aäoöaå"); return 0; } 
+6
source

Alter Mann's accepted answer matches the correct strings, except that you don’t just need to specify an arbitrary function to count the number of bytes in a multibyte string, encode the visible character: you must localize the code using setlocale(LC_ALL, "") or similar, and strlen(str) - mbstowcs(NULL, str, 0) to count the number of bytes in a string that does not encode a visible character.

setlocale() is standard C (C89, C99, C11), but is also defined in POSIX.1. mbstowcs() is the standard C99 and C11, and is also defined in POSIX.1. Both of them are also implemented in Microsoft C libraries, so they work almost everywhere.

Consider the following sample program that prints the C lines specified on the command line:

 #include <stdlib.h> #include <string.h> #include <locale.h> #include <stdio.h> /* Counts the number of (visible) characters in a string */ static size_t ms_len(const char *const ms) { if (ms) return mbstowcs(NULL, ms, 0); else return 0; } /* Number of bytes that do not generate a visible character in a string */ static size_t ms_extras(const char *const ms) { if (ms) return strlen(ms) - mbstowcs(NULL, ms, 0); else return 0; } int main(int argc, char *argv[]) { int arg; /* Default locale */ setlocale(LC_ALL, ""); for (arg = 1; arg < argc; arg++) printf(">%-*s< (%zu bytes; %zu chars; %zu bytes extra in wide chars)\n", (int)(10 + ms_extras(argv[arg])), argv[arg], strlen(argv[arg]), ms_len(argv[arg]), ms_extras(argv[arg])); return EXIT_SUCCESS; } 

If you compile the above example value and you run

 ./example aaa aaä aää äää aa€ a€€ €€€ a ä € 😈 

the program will output

 >aaa < (3 bytes; 3 chars; 0 bytes extra in wide chars) >aaä < (4 bytes; 3 chars; 1 bytes extra in wide chars) >aää < (5 bytes; 3 chars; 2 bytes extra in wide chars) >äää < (6 bytes; 3 chars; 3 bytes extra in wide chars) >aa€ < (5 bytes; 3 chars; 2 bytes extra in wide chars) >a€€ < (7 bytes; 3 chars; 4 bytes extra in wide chars) >€€€ < (9 bytes; 3 chars; 6 bytes extra in wide chars) >a < (1 bytes; 1 chars; 0 bytes extra in wide chars) >ä < (2 bytes; 1 chars; 1 bytes extra in wide chars) >€ < (3 bytes; 1 chars; 2 bytes extra in wide chars) >😈 < (4 bytes; 1 chars; 3 bytes extra in wide chars) 

If the last < does not coincide with the others, this is because the font used is not exactly fixed-width: the emoticon 😈 wider than ordinary characters, such as Ä , that's all. Blame the font.

The last character is U + 1F608 LOOK FACE WITH HORNS, from the Unicode block for emoticons , if your OS / browser / font cannot display it. On Linux, all of the above > and < correctly match in all the terminals that I have, including the console (non-graphical system console), although the console font does not have a glyph for the emoticon, but instead just displays it as a diamond.

Unlike Alter Mann's answer , this approach is portable and makes no assumptions about which character set the current user is actually using.

+2
source

Source: https://habr.com/ru/post/1243107/


All Articles