How to open a file with wchar_t * containing a non-Ascii string on Linux?

Question

How to open a file with wchar_t * containing a non-Ascii string on Linux?

Environment: Gcc / g ++ Linux

I have a non-ascii file on the file system and am going to open it.

Now I have wchar_t *, but I don’t know how to open it. (my trusted fopen only opens char * file)

Please, help. Thank you very much.

+4

c ++ c linux file wchar

Cauly Jan 13 '11 at 2:49

source share

6 answers

Linux is not UTF-8, but it is your only choice for file names anyway

(Files can have anything inside them.)

Regarding file names, linux really doesn't have a string encoding to worry about. File names are byte strings that must be terminated with zeros.

This does not mean that Linux is UTF-8, but it does mean that it is incompatible with wide characters, since they can be zero in the byte, and not at the end of the byte.

But UTF-8 preserves the no-nulls-except-at-end model, so I should assume that the practical approach is to "convert to UTF-8" for file names.

The contents of the files depend on standards above the level of the Linux kernel, so there is nothing Linux-y that you can or want to do. The contents of the files will be solely the concern of programs that read and write them. Linux just saves and returns a stream of bytes, and it can have all the built-in zeros you want.

+3

Digitaloss Jan 13 '11 at 3:40

source share

Convert the wchar string to utf8 char string, then use fopen.

 typedef unsigned int uint; typedef unsigned short word; typedef unsigned char byte; int UTF16to8( wchar_t* w, char* s ) { uint c; word* p = (word*)w; byte* q = (byte*)s; byte* q0 = q; while( 1 ) { c = *p++; if( c==0 ) break; if( c<0x080 ) *q++ = c; else if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63); } *q = 0; return q-q0; } int UTF8to16( char* s, wchar_t* w ) { uint cache,wait,c; byte* p = (byte*)s; word* q = (word*)w; word* q0 = q; while(1) { c = *p++; if( c==0 ) break; if( c<0x80 ) cache=c,wait=0; else if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else if( (c>=0xE0) ) cache=c&15,wait=2; else if( wait ) (cache<<=6)+=c&63,wait--; if( wait==0 ) *q++=cache; } *q = 0; return q-q0; }

+1

Shelwien Jan 13 '11 at 3:14

source share

Check out this document.

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows the POSIX standard, which treats all file names as UTF-8.

0

Peon the great Jan 13 '11 at 3:03

source share

I accept this file name, which contains non-ascii characters, and not the file itself when you say "file without ascii in the file system". It doesn't really matter what the file contains.

You can do this with regular fopen, but you have to match the encoding used by the file system.

It depends on which version of Linux and which file system you are using, and how you configured it, but if you're lucky, the file system uses UTF-8. So, take your wchar_t (which is probably encoded in UTF-16?), Convert it to a char string encoded in UTF-8, and pass it to fopen.

0

metamatt Jan 13 '11 at 3:03

source share

 // locals string file_to_read; // any file wstring file; // read ascii or non-ascii file here FILE *stream; int read = 0; wchar_t buffer= '0'; if( fopen_s( &stream, file_to_read.c_str(), "r+b" ) == 0 ) // in binary mode { while( !feof( stream )) { // if ascii file second arg must be sizeof(char). if non ascii file sizeof( wchar_t) read = fread( & buffer, sizeof( char ), 1, stream ); file.append(1, buffer); } } file.pop_back(); // since this code reads the last character twice.Throw the last one fclose(stream); // and the file is in wstring format.You can use it in any C++ wstring operation // this code is fast enough i think, at least in my practice // for windows because of fopen_s

0

Tanzer Aug 25 '14 at 20:37

source share

R .. · Accepted Answer · 2011-01-13T04:11:58+0000

There are two possible answers:

If you want all Unicode file names to be represented, you can rigidly formulate the assumption that the file system uses UTF-8 file names. This is a “modern” approach for Linux desktops. Just convert the strings from wchar_t (UTF-32) to UTF-8 with library functions ( iconv will work well) or your own implementation (but find the specifications so you don't make a mistake as Shelvian did), then use fopen .

If you want to make something more standard-oriented, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in locale encoding (which I hope is UTF-8 anyway on any modern system) and use fopen . Please note that this requires that you previously set the locale using setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "") .

And finally, not quite the answer, but the recommendation:

Saving file names as wchar_t strings is probably a terrible mistake. Instead, you should store the file names as abstract byte strings and only convert them to wchar_t just in time to display them in the user interface (if it is even necessary for this: many user interface tools use simple byte strings and interpret as characters for you ) Thus, you eliminate many possible unpleasant angular cases, and you never encounter a situation where some files are inaccessible due to their names.

How to open a file with wchar_t * containing a non-Ascii string on Linux?

Linux is not UTF-8, but it is your only choice for file names anyway

More articles: