Writing Unicode Strings to a File

Question

Writing Unicode Strings to a File

I am trying to develop a class for reading and writing files. There are two ways for strings: ANSI and Unicode. ANSI is working fine, but something is wrong with my Unicode.

it's a bit connected and I can read Unicode files just fine, I mean, without checking or skipping "0xFEFF". it works no matter what language i am in (i have tried english, chinese and japanese). is there anything i should know about?

then the biggest problem popped up: write Unicode lines to a file. At first I tried plain English as an alphabet without the "\ n" character, it worked perfectly. then I press "\ n" and everything starts to go wrong: the output is inserted with many spaces like "abcdefg \ nh i jklmn \ nopqrst \ nuvwxyz" ('\ n' works, but there are so many spaces) and the file is ANSI again . don't ask characters in other languages, I can't even read them.

So, here is the question: what should I do to correctly write Unicode lines to a file and how? do not mention the function "_wopen", the file is already open with the function "fopen".

Answers and recommendations will be so appreciated.

I am using windows 7 and visual studio.

Edit: it works for non-English characters with the following code, but still wrong with '\ n'.

char* cStart = "\xff\xfe"; if (::ftell(m_pFile) == 0) ::fwrite(cStart, sizeof(wchar_t), 1, m_pFile);

but how does it work? I mean, I did not see this while I was reading the file.

Edit: part of my code.

 void File::ReadWText(wchar_t* pString, uint32 uLength) { wchar_t cLetter = L'\0'; uint32 uIndex = 0; do { cLetter = L'\0'; ::fread(&cLetter, sizeof(wchar_t), 1, m_pFile); pString[uIndex] = cLetter; }while (cLetter != L'\0' && !::feof(m_pFile) && uIndex++ < uLength); pString[uIndex] = L'\0'; } void File::WriteWText(wchar_t* pString, uint32 uLength) { char* pStart = "\xff\xfe"; if (::ftell(m_pFile) == 0) ::fwrite(pStart, sizeof(wchar_t), 1, m_pFile); m_uSize += sizeof(wchar_t) * ::fwrite(pString, sizeof(wchar_t), uLength, m_pFile); } void main() { ::File* pFile = new File(); wchar_t* pWString = L"abcdefg\nhijklmn\nopqrst\nuvwxyz"; pFile->Open("TextW.txt", File::Output); // fopen("TextW.txt", "w"); pFile->WriteWText(pWString, ::wcslen(pWString)); pFile->Close(); }

Output file content: "abcdefg ਍ 栀椀樀欀氀洀渀渀 ഀ ഀ ਍ 甀甀瘀眀砀礀稀", Unicode file.

I don’t know if this is the correct expression "L" \ n ", I have never worked with Unicode before. Thanks for helping me :)

+4

c ++ c unicode-string

TCai Feb 09 '12 at 19:26

source share

3 answers

Dietmar Kühl · Answer 1 · 2012-02-09T20:09:28+0000

I just noticed that this question is marked by C and C ++: the situation in C ++ is discussed below. It completely ignores usage, and I don't know how to use different encodings.

When reading or writing a file, you need to tell the system what the file encoding is so that it can convert bytes to a file into characters internal to the program when reading and converting characters to bytes when writing. In many cases, this conversion is completely ignored, since conversion from bytes to characters is an identifier: bytes can be interpreted as characters and vice versa. This is true when external encoding is ASCII (I assume this is referred to as "ANSI" in your question).

Pretending to be UTF-8 encoded files use identity conversion to convert from bytes to working with characters in some extensions. The initial vision of the internal representation of characters in C ++ was to have one unit per character, for example. a char or wchar_t . Although Unicode set a set of goals that would work well with this (for example, each character is represented by one unit, and the block size is 16 bits), they felt like victims of all their original goals, and we ended up with a system where one character (well, I I think they are actually called "code points", but I'm not a Unicode expert) can consist of several words (for example, when using combined characters). In any case, until individual units are mutated without paying attention to the character, as a rule, you can treat UTF-8 as a char sequence (for example, as std::string ) and UTF-16 as a sequence of wchar_t (for example, as std::wstring ). However, when reading something other than UTF-8 (or ASCII, which is a subset of UTF-8), you should be careful to set up the stream so that it knows what encoding is used.

The standard way to configure a file stream to obtain information about a particular encoding is to create a suitable std::locale that contains the corresponding facet std::codecvt<...> , which converts external bytes and internal characters using its specific encoding. How to actually get the corresponding std::locale before a specific implementation. The default conversion is intended to pretend that the program uses the ASCII extension, which covers all char values. When reading and writing UTF-8, this should work.

I'm not sure what you mean by "write Unicode strings", but from the look you write std::wstring without setting the encoding.

KoKuToru · Answer 2 · 2012-02-09T21:03:32+0000

The answer to the edited question with the source:

void File::ReadWText(wchar_t* pString, uint32 uLength) is an error. If uLength is the size of the array ( wchar_t string[size] )

while (.... && uIndex++ < uLength); should be while (.... && (++uIndex)+1 < uLength);

Otherwise pString[uIndex] = L'\0'; may overflow!

New row problem. L"abcdefg\nhijklmn\nopqrst\nuvwxyz"; windows use \r\n as a new line. L"abcdefg\r\nhijklmn\r\nopqrst\r\nuvwxyz"; must work.

Based on this problem, msdn-thread unicode newline and your // fopen("TextW.txt", "w"); , I believe that you should open the file with "wb" ! Otherwise, \n will automatically expand to \r\n , which will cause the encoding to hang in Unicode.

KoKuToru · Answer 3 · 2012-02-09T20:14:07+0000

This can help..

Remember to write the specification at the beginning of the FF FE .

Because you did not send any code. I believe you are writing a new line as ASCII '\n' (as written in your question)

For a new line you need to write 0D 00 0A 00

Or if you want to use '\n' , you should make it (short)'\n'

Writing Unicode Strings to a File

More articles: