Error comparing UTF-8 characters with wchar.h

I am making a small program that reads a file containing UTF-8 elements, char on char. After reading the char, it compares it with several other characters and, if there is a match, it replaces the character in the file with the underscore character '_'.

(Well, this actually duplicates this file with special letters replaced by underscores.)

I'm not sure where exactly I was messing around here, but this is most likely everywhere.

Here is my code:

FILE *fpi; FILE *fpo; char ifilename[FILENAME_MAX]; char ofilename[FILENAME_MAX]; wint_t sample; fpi = fopen(ifilename, "rb"); fpo = fopen(ofilename, "wb"); while (!feof(fpi)) { fread(&sample, sizeof(wchar_t*), 1, fpi); if ((wcscmp(L"ά", &sample) == 0) || (wcscmp(L"ε", &sample) == 0) ) { fwrite(L"_", sizeof(wchar_t*), 1, fpo); } else { fwrite(&sample, sizeof(wchar_t*), 1, fpo); } } 

I skipped the code that is associated with generating the file name because it has nothing to offer. This is just string manipulation.

If I submit this program to a file containing the words γειά σου κόσμε. , I would like him to return this: γει_ σου κόσμ_.

A search on the Internet did not help much, since most of the results were very general or talked about completely different things regarding UTF-8. Like no one needs to manipulate individual characters for some reason.

Anything that points me to the right path is welcome. I'm not necessarily looking for a direct, fixed version of the code that I submitted, I would be grateful for any insightful comments that help me understand how the wchar mechanism works. All wbyte, wchar, L, no-L, this is a mess for me.

Thank you in advance for your help.

+4
source share
2 answers

First of all, please take the time to read this wonderful article that explains UTF8 vs Unicode and many other important things about strings and encodings: http://www.joelonsoftware.com/articles/Unicode.html

What you are trying to do in your code is read in Unicode character by character and compared with them. This will not work if the input stream is UTF8, and it is really impossible to do with this structure.

In short: Fully unicode strings can be encoded in several ways. One of them uses a series of "wide" characters of the same size, one for each character. To do this, type wchar_t (sometimes WCHAR) is used. Another way is UTF8, which uses a variable number of raw bytes to encode each character depending on the value of the character.

UTF8 is just a stream of bytes that can encode a Unicode string and is commonly used in files. This is not the same as the WCHAR string, which is a more common representation in memory. You cannot reliably push the UTF8 stream and directly replace it in it. You will need to read all this and decrypt it, and then scroll through the WCHARs that will lead to your comparisons and replacements, and then compare the result with UTF8 to write to the output file.

In Win32, use MultiByteToWideChar to perform decoding, and you can use the corresponding WideCharToMultiByte to return.

When you use "string literal" with regular quotes, you create an ASCII string with nul-terminated ( char* ), which does not support Unicode. L"string literal" with the prefix L will create a line with zero termination WCHAR (wchar_t *), which you can use in comparing strings or characters. The L prefix also works with single-quoted letter characters, for example: L'ε'


As the commentator noted, when you use fread / fwrite, you should use sizeof(wchar_t) , not its pointer type, since the amount you are trying to read / write is the actual wchar, not the size of the pointer by one. This tip is just feedback from the code, regardless of the above - you do not want to read the input character by character every time.

Note that when you do string comparisons ( wcscmp ), you should use the actual wide strings (which end with nul wide char) - do not use individual characters in memory as input. If (when) you want to perform character-to-character comparisons, you don't even need to use string functions. Since WCHAR is just a value, you can directly compare: if (sample == L'ά') {} .

+3
source

C has two different types of characters: multibyte characters and wide characters.

Multibyte characters can accept a different number of bytes. For example, in UTF-8 (which is a Unicode variable-length encoding) a takes 1 byte, and α takes 2 bytes.

Wide characters always take the same number of bytes. In addition, wchar_t should be able to hold any character from the execution character set. Thus, when using UTF-32, both a and α take 4 bytes each. Unfortunately, on some platforms, wchar_t 16 bits wide: such platforms cannot correctly support characters outside of BMP using wchar_t . If __STDC_ISO_10646__ defined, then wchar_t contains Unicode code codes, so there should be (at least) 4 bytes (technically it should be at least 21 bits).

So, when using UTF-8 you should use multibyte characters that are stored in regular char variables (but beware of strlen() , which counts bytes, not multibyte characters).

Unfortunately for Unicode more.

ά can be represented as one Unicode code or as two separate code points:

  • U+03AC GREEK SMALL LETTER ALPHA WITH TONOS ← 1 codepoint ← 1 multibyte character ← 2 bytes ( 0xCE 0xAC ) = 2 char 's.
  • U+03B1 GREEK SMALL LETTER ALPHA U+0301 COMBINING ACUTE ACCENT ← 2 codepoints ← 2 multibyte characters ← 4 bytes ( 0xCE 0xB1 0xCC 0x81 ) = 4 char .
  • U+1F71 GREEK SMALL LETTER ALPHA WITH OXIA ← 1 codepoint ← 1 multibyte character ← 3 bytes ( 0xE1 0xBD 0xB1 ) = 3 char 's.

All of the above are canonical equivalents, which means that they should be considered as equal for all purposes. Thus, you must normalize your input / output lines using one of the Unicode normalization algorithms (4: NFC, NFD, NFKC, NFKD).

+6
source

Source: https://habr.com/ru/post/1433018/


All Articles