Alter Mann's accepted answer matches the correct strings, except that you don’t just need to specify an arbitrary function to count the number of bytes in a multibyte string, encode the visible character: you must localize the code using setlocale(LC_ALL, "") or similar, and strlen(str) - mbstowcs(NULL, str, 0) to count the number of bytes in a string that does not encode a visible character.
setlocale() is standard C (C89, C99, C11), but is also defined in POSIX.1. mbstowcs() is the standard C99 and C11, and is also defined in POSIX.1. Both of them are also implemented in Microsoft C libraries, so they work almost everywhere.
Consider the following sample program that prints the C lines specified on the command line:
#include <stdlib.h> #include <string.h> #include <locale.h> #include <stdio.h> /* Counts the number of (visible) characters in a string */ static size_t ms_len(const char *const ms) { if (ms) return mbstowcs(NULL, ms, 0); else return 0; } /* Number of bytes that do not generate a visible character in a string */ static size_t ms_extras(const char *const ms) { if (ms) return strlen(ms) - mbstowcs(NULL, ms, 0); else return 0; } int main(int argc, char *argv[]) { int arg; /* Default locale */ setlocale(LC_ALL, ""); for (arg = 1; arg < argc; arg++) printf(">%-*s< (%zu bytes; %zu chars; %zu bytes extra in wide chars)\n", (int)(10 + ms_extras(argv[arg])), argv[arg], strlen(argv[arg]), ms_len(argv[arg]), ms_extras(argv[arg])); return EXIT_SUCCESS; }
If you compile the above example value and you run
./example aaa aaä aää äää aa€ a€€ €€€ a ä € 😈
the program will output
>aaa < (3 bytes; 3 chars; 0 bytes extra in wide chars) >aaä < (4 bytes; 3 chars; 1 bytes extra in wide chars) >aää < (5 bytes; 3 chars; 2 bytes extra in wide chars) >äää < (6 bytes; 3 chars; 3 bytes extra in wide chars) >aa€ < (5 bytes; 3 chars; 2 bytes extra in wide chars) >a€€ < (7 bytes; 3 chars; 4 bytes extra in wide chars) >€€€ < (9 bytes; 3 chars; 6 bytes extra in wide chars) >a < (1 bytes; 1 chars; 0 bytes extra in wide chars) >ä < (2 bytes; 1 chars; 1 bytes extra in wide chars) >€ < (3 bytes; 1 chars; 2 bytes extra in wide chars) >😈 < (4 bytes; 1 chars; 3 bytes extra in wide chars)
If the last < does not coincide with the others, this is because the font used is not exactly fixed-width: the emoticon 😈 wider than ordinary characters, such as Ä , that's all. Blame the font.
The last character is U + 1F608 LOOK FACE WITH HORNS, from the Unicode block for emoticons , if your OS / browser / font cannot display it. On Linux, all of the above > and < correctly match in all the terminals that I have, including the console (non-graphical system console), although the console font does not have a glyph for the emoticon, but instead just displays it as a diamond.
Unlike Alter Mann's answer , this approach is portable and makes no assumptions about which character set the current user is actually using.