Ignore byte bytes in C ++ counting from stream

Question

Ignore byte bytes in C ++ counting from stream

I have a function to read the value of one variable (integer, double or boolean) on one line in ifstream :

 template <typename Type> void readFromFile (ifstream &in, Type &val) { string str; getline (in, str); stringstream ss(str); ss >> val; }

However, it does not work in text files created with editors, inserting the BOM specification (

+6

c ++ unicode

F'x Jan 16 '12 at 13:17

source share

2 answers

You need to start by reading the first byte or two streams, and deciding whether this is part of the specification or not. This is a bit of a pain, since you can only putback one byte, while you will usually want to read four. The easiest solution is to open the file, read the initial bytes, remember how much you need to skip, and then go back to start and skip them.

+4

James kanze Jan 16 '12 at 13:32

source share

bames53 · Accepted Answer · 2012-01-16T15:20:41+0000

(I assume that you are running Windows, since using U + FEFF as a signature in UTF-8 files is mostly Windows, and should be avoided elsewhere)

You can open the file as a UTF-8 file and then check if the first character is U + FEFF. You can do this by opening a regular char stream, and then use wbuffer_convert to treat it as a series of blocks of code in a different encoding. VS2010 does not yet have much support for char32_t, so the next use of UTF-16 is in wchar_t.

 std::fstream fs(filename); std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf()); std::wistream is(&wb); // if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs. std::wistream::int_type ch = is.get(); const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF if(ZERO_WIDTH_NO_BREAK_SPACE != ch) is.putback(ch); // now the stream can be passed around and used without worrying about the extra character in the stream. int i; readFromStream<int>(is,i);

Remember that this should be done in the file stream as a whole, and not inside readFromFile in your string stream, because ignoring U + FEFF should only be done if it is the very first character in the whole file, if at all. This should not be done anywhere.

On the other hand, if you are happy with using a char stream and just want to skip U + FEFF, if present, then James Kanze's suggestion seems good, so the implementation is implemented here:

 std::fstream fs(filename); char a,b,c; a = fs.get(); b = fs.get(); c = fs.get(); if(a!=(char)0xEF || b!=(char)0xBB || c!=(char)0xBF) { fs.seekg(0); } else { std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n" }

Also, if you want to use wchar_t internally, the codecvt_utf8_utf16 and codecvt_utf8 have a mode that can use “specifications” for you. The only problem is that wchar_t widely recognized as useless these days *, and therefore you probably shouldn't do this.

 std::wifstream fin(filename); fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));

_{* wchar_t useless since it is specified for only one thing;} _{indicate a data type of fixed size that can represent any point in the code in the character’s repertoire.} _{It does not provide a general representation between locales (i.e. the same wchar_t value can be a different character in different locales, so you cannot convert it to wchar_t , switch to another language and then convert back to char in order to make iconv - similar encoding conversions.)}

_{A fixed-size view is useless for two reasons;} _{firstly, many code points have semantic meanings, so understanding the text means that you need to process several code points.} _{Secondly, some platforms, such as Windows, use UTF-16 as the encoding wchar_t , which means that one wchar_t is not even necessarily a code value.} _{(When using UTF-16, this method even complies with the standard, is ambiguous. The standard requires that each character supported by the locale be represented as one wchar_t value; If no language standard supports any character outside BMP, then UTF-16 can considered compatible.)}

Ignore byte bytes in C ++ counting from stream

More articles: