Why is the mbtowc character set not set as expected?

I want to count characters (in different encodings) in a file, and I use the 'mbtowc' function to detect characters. I can’t understand why the meanings of the symbols and the results are different. Here is my example:

char buf[BUFFER_SIZE + MB_LEN_MAX]; int fd = open ("chinese_test", O_RDONLY); unsigned int bytes, chars; int bytes_read; bytes = chars = 0; while((bytes_read = read(fd, buf, BUFFER_SIZE)) > 0) { wchar_t wc_buf[BUFFER_SIZE], *wcp; char *p; int n = 0; bytes += bytes_read; p = buf; wcp = wc_buf; while((n = mbtowc(wcp, p, MB_LEN_MAX)) > 0) { p += n; wcp++; chars++; } } printf("chars: %d\tbytes: %d\n", chars, bytes); 

I am testing a function with text with some GB2312 characters, but the characters and bytes are too different values.

My test result β†’ characters: 4638 | bytes: 17473 but the command 'wc' linux returns: chars: 16770 | bytes: 17473

Why is this a difference? What have I done wrong?


I now have this code, but there is still a difference as a result.

 char buf[BUFFER_SIZE * MB_LEN_MAX]; int fd = open ("test_chinese", O_RDONLY), filled = 0; unsigned int bytes, chars; int bytes_read; bytes = chars = 0; while((bytes_read = read(fd, buf, BUFFER_SIZE)) > 0) { wchar_t wc_buf[BUFFER_SIZE], *wcp; char *p; int n = 0; bytes += bytes_read; p = buf; wcp = wc_buf; while(bytes_read > 0) { n = mbtowc(NULL, p, MB_LEN_MAX); if (n <= 0) { p++; bytes_read--; continue; } p += n; bytes_read -= n; chars++; } } printf("\n\nchars: %d\tbytes: %d\n", chars, bytes); 
+6
source share
1 answer

The problem is a combination of your BUFFER_SIZE , chinese_test file chinese_test and wchar_t byte alignment. As evidence, try drastically increasing BUFFER_SIZE - you should start getting the answer you want.

What happens is that your program works for the first block of text that it receives. But think about what happens in your code if a character is split between the first and second blocks as follows:

  | First Block | Second Block | | [wchar_t] [wchar_t] ... [wchar_t] [wchar_t] ... | | [1,2,3,4] [1,2,3,4] ... [1,2,3,4] [1,2,3,4] ... | 

Your code will start the second block by the 3rd byte in the first character, and this will not be recognized as valid. Since mbtowc will return -1 when it does not find a valid character, your loop will end immediately and will count the zero characters for this entire block. The same applies to the following blocks.

EDIT:
Another problem that I noticed is that for mbtowc to work mbtowc you need to set the locale. Given all these issues, I wrote the following, which returns the same character for me as wc :

 #include <stdlib.h> #include <stdio.h> #include <locale.h> int BUFFER_SIZE = 1024; const char *DEFAULT_F_IN = "chinese_test"; struct counts { int bytes; int chars; }; int count_block(struct counts *c, char *buf, int buf_size) { int offset = 0; while (offset < buf_size) { int n = mbtowc(NULL, buf + offset, MB_CUR_MAX); if (n <= 0) { break; } offset += n; c->bytes += n; c->chars++; } return buf_size - offset; } void get_counts(struct counts *c, FILE *fd) { char buf[BUFFER_SIZE]; c->bytes = 0; c->chars = 0; int bytes_read; while((bytes_read = fread(buf, sizeof(*buf), BUFFER_SIZE, fd)) > 0) { int remaining = count_block(c, buf, bytes_read); if (remaining == 0) { continue; } else if (remaining < MB_CUR_MAX) { fseek(fd, -remaining, SEEK_CUR); } else { perror("Error"); exit(1); } } } int main(int argc, char *argv[]) { FILE *fd; if (argc > 1) { fd = fopen(argv[1], "rb"); } else { fd = fopen(DEFAULT_F_IN, "rb"); } setlocale(LC_ALL, ""); struct counts c; get_counts(&c, fd); printf("chars: %d\tbytes: %d\n", c.chars, c.bytes); return 0; } 
+6
source

Source: https://habr.com/ru/post/908113/


All Articles