Could not get some utf-8 characters correctly printed in C ++ in txt file

I have some UTF-8 lines in memory (this is part of a larger system) that are mostly called places in European countries. What I'm trying to do is write them to a text file. I am on my Linux machine (Fedora). Therefore, when I write these name strings (char pointers) to a file, the file becomes saved in the extended ASCII format.

Now I copy this file to my Windows machine, where I need to load these names into mySQL database. When I open a text file in Notepad ++, it encodes ANSI by default. But I can choose the encoding for UTF-8, and almost all the characters look as expected, with the exception of the following three characters: - Ő, ő and ű. They are displayed in the text as No. 336, No. 337 and No. 369.

Does anyone think about what might be wrong. I know that they are not part of the extended ASCII characters. But the way I write this in a file is something like:

// create out file stream std::ofstream fs("sample.txt"); // loop through utf-8 formatted string list if(fs.is_open()) { for(int i = 0; i < num_strs; i++) { fs << str_name; // unsigned char pointer representing name in utf-8 format fs << "\n"; } } fs.close(); 

Everything looks good even with characters like ú and ö and ß. The problem is only three characters. Any thoughts / suggestions / comments about this? Thanks!

As an example, a line like "Gyömrő" is displayed as "Gyömr & # 369".

+4
source share
3 answers

You need to determine at what stage the unexpected and # 336 HTML objects are introduced. My best guess is that they are already in the line you are writing to the file. Use a debugger or add test code that counts & s in a string.

This means that your source of information strictly does not use UTF-8 for non-ASCII characters, but sometimes uses HTML objects . This is strange, but possible if your data source is an HTML file (or something like that).

In addition, you can view your output file in HEX mode. (There is a good plugin for Notepad ++). This may help you understand what UTF-8 actually means at the byte level: 128 ASCII characters use one byte of the value 0-127. Other characters use 2-6 bytes (I think), where the first byte should be> 127. HTML entities are not really an encoding, but rather a control sequence such as "\ n" and "r".

+3
source

If when opening in Notepad ++ and selecting UTF-8, and your characters do not display correctly, they are not encoded as UTF-8. You also mention "extended ASCII", which has very little to do with unicode encodings. And I am convinced that you are actually writing your characters as some kind of code page, for example, "ISO-8859-1."

Try looking at the number of bytes of these error strings in your program, and if the number of bytes matches the number of characters, then you are not actually encoding it as UTF-8.

Any character that lies outside the 128-character ASCII character table will be encoded with at least two bytes in UTF-8.

To handle Unicode correctly in your C ++ application, take a look at ICU: http://site.icu-project.org/

+1
source

By default, std::codecvt<char, char, mbstate_t> does not suit you: this means that it does not convert at all. You will need imbue() a std::locale with the UTF-8 code conversion phase. However, char cannot really represent Unicode values. You will need a larger type, although the values ​​you are looking for do indeed fit in char in Unicode, but not in any encoding that accepts all values.

The C ++ 2011 standard defines the UTF-8 conversion facet std::codecvt_utf<...> . However, it is not specialized for the internal char type, but only for wchar_t , uint16_t and uint32_t . Using clang along with libC ++, I could do the following to do the right thing:

 #include <fstream> #include <locale> #include <codecvt> int main() { std::wofstream out("utf8.txt"); std::locale utf8(std::locale(), new std::codecvt_utf8<wchar_t>()); out.imbue(utf8); out << L"\xd6\xf6\xfc\n"; out << L"Ööü\n"; } 

Note that this code uses wchar_t , not char . It might seem reasonable to use char16_t or char32_t because they are designed to encode UCS2 and UCS4 respectively (if I understand the standard correctly), but the stream type is not defined for them. Setting thread types for a new character type is a pain.

-1
source

Source: https://habr.com/ru/post/1435589/


All Articles