How to use UTF-8 in C code?

My setup: gcc-4.9.2, UTF-8 environment.

The following C program works in ASCII, but not in UTF-8.

Create an input file:

echo -n ' ' > /tmp/ 

This is test.c:

 #include <stdio.h> #include <stdlib.h> #include <string.h> #define SIZE 10 int main(void) { char buf[SIZE+1]; char *pat = " "; char str[SIZE+2]; FILE *f1; FILE *f2; f1 = fopen("/tmp/","r"); f2 = fopen("/tmp/","w"); if (fread(buf, 1, SIZE, f1) > 0) { buf[SIZE] = 0; if (strncmp(buf, pat, SIZE) == 0) { sprintf(str, "% 11s\n", buf); fwrite(str, 1, SIZE+2, f2); } } fclose(f1); fclose(f2); exit(0); } 

Check the result:

 ./test; grep -q '  ' /tmp/ && echo OK 

What you need to do to make UTF-8 code work as if it were ASCII code is to not disturb the number of bytes that the character accepts, etc. In other words: what to change in the example for processing any UTF-8 character as one block (including argv, STDIN, STDOUT, STDERR, file input, output, and program code)?

+6
source share
5 answers
 #define SIZE 10 

Buffer size 10 is not enough to store a UTF-8 string . Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20 worked fine.

UTF-8 is a multibyte encoding that uses from 1 to 4 bytes per character. Thus, it is safer to use 40 as the buffer size above. There is a big discussion in How many bytes does one Unicode character occupy? which may be interesting.

+7
source

Siddhartha Ghosh answer gives you the main problem. However, fixing the code requires more work.

I used the following script ( chk-utf8-test.sh ):

 echo -n ' ' >  make utf8-test ./utf8-test grep -q ' '  && echo OK 

I called your program utf8-test.c and made corrections to the source like this, removing the links to /tmp and more carefully with the lengths:

 #include <stdio.h> #include <stdlib.h> #include <string.h> #define SIZE 40 int main(void) { char buf[SIZE + 1]; char *pat = " "; char str[SIZE + 2]; FILE *f1 = fopen("", "r"); FILE *f2 = fopen("", "w"); if (f1 == 0 || f2 == 0) { fprintf(stderr, "Failed to open one or both files\n"); return(1); } size_t nbytes; if ((nbytes = fread(buf, 1, SIZE, f1)) > 0) { buf[nbytes] = 0; if (strncmp(buf, pat, nbytes) == 0) { sprintf(str, "%.*s\n", (int)nbytes, buf); fwrite(str, 1, nbytes, f2); } } fclose(f1); fclose(f2); return(0); } 

And when I ran the script, I got:

 $ bash -x chk-utf8-test.sh + '[' -f /etc/bashrc ']' + . /etc/bashrc ++ '[' -z '' ']' ++ return + alias 'r=fc -e -' + echo -n ' ' + make utf8-test gcc -O3 -g -std=c11 -Wall -Wextra -Werror utf8-test.c -o utf8-test + ./utf8-test + grep -q ' ' $'?\213?\205' + echo OK OK $ 

For recording, I used GCC 5.1.0 on Mac OS X 10.10.3.

+6
source

This is more related to the other answers, but I will try to explain it from a slightly different angle.

Here is a version of your Jonathan Leffler code with three minor changes: (1) I made explicit actual individual bytes in UTF-8 strings; and (2) I modified the sprintf format string width specifier to hope for what you are actually trying to do. Also tangentially (3) I used perror to get a slightly more useful error message when something fails.

 #include <stdio.h> #include <stdlib.h> #include <string.h> #define SIZE 40 int main(void) { char buf[SIZE + 1]; char *pat = "\320\277\321\200\320\270\320\262\320\265\321\202" " \320\274\320\270\321\200"; /* " " */ char str[SIZE + 2]; FILE *f1 = fopen("\320\262\321\205\320\276\320\264", "r"); /* "" */ FILE *f2 = fopen("\320\262\321\213\321\205\320\276\320\264", "w"); /* "" */ if (f1 == 0 || f2 == 0) { perror("Failed to open one or both files"); /* use perror() */ return(1); } size_t nbytes; if ((nbytes = fread(buf, 1, SIZE, f1)) > 0) { buf[nbytes] = 0; if (strncmp(buf, pat, nbytes) == 0) { sprintf(str, "%*s\n", 1+(int)nbytes, buf); /* nbytes+1 length specifier */ fwrite(str, 1, 1+nbytes, f2); /* +1 here too */ } } fclose(f1); fclose(f2); return(0); } 

The behavior of sprintf with a positive numeric width specifier is to fill in the blanks on the left, so the space you tried to use is redundant. But you have to make sure that the target field is wider than the line you are printing so that there really is some kind of padding.

To make this answer self-sufficient, I will repeat what others have already said. A traditional char always exactly one byte, but one character in UTF-8 is usually not exactly one byte, unless all of your characters are actually ASCII. One of the attractions of UTF-8 is that legacy C code does not need to know anything about UTF-8 in order to continue to work, but of course, the assumption that one char is one glyph cannot be saved. (As you can see, for example, the glyph n in "hello world" is matched with two bytes - and therefore two char - "\320\277" .)

This is clearly less than ideal, but it demonstrates that you can treat UTF-8 as β€œjust bytes” if your code doesn't really care about the semantics of glyphs. If so, you'd better switch to wchar_t as indicated, for example. here: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

However, the wchar_t standard is less than ideal when the standard expectation is UTF-8. See the GNU documentation publication for a less intrusive alternative and a bit of background. In this case, you can replace char with uint8_t and various str* functions with u8_str* replacement and execute. It is assumed that one glyph is equal to one byte, but this will be a minor technique in your sample program. Customization is available at http://ideone.com/p0VfXq (although, unfortunately, the library is not available at http://ideone.com/ so that it cannot be demonstrated there).

+3
source

The following code works as needed:

 #include <stdio.h> #include <locale.h> #include <stdlib.h> #include <wchar.h> #define SIZE 10 int main(void) { setlocale(LC_ALL, ""); wchar_t buf[SIZE+1]; wchar_t *pat = L" "; wchar_t str[SIZE+2]; FILE *f1; FILE *f2; f1 = fopen("/tmp/","r"); f2 = fopen("/tmp/","w"); fgetws(buf, SIZE+1, f1); if (wcsncmp(buf, pat, SIZE) == 0) { swprintf(str, SIZE+2, L"% 11ls", buf); fputws(str, f2); } fclose(f1); fclose(f2); exit(0); } 
+1
source

Your test.c file is test.c not saved in UTF-8 format, and for this reason the string "hello world" is ASCII - and the comparison failed. Change the text encoding of the source file and try again.

0
source

Source: https://habr.com/ru/post/987714/


All Articles