C ++ How to check byte byte size in order to get UTF-8?

I wonder how to check the byte size of a file byte to get if it is UTF-8 in C ++?

+4
source share
3 answers

In general, you cannot.

Having a byte order mark is a very strong sign that the file you are reading is Unicode. If you expect a text file and the first four bytes you get:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE 0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE 0xfe, 0xff, XX, XX -- The file is almost certainly UTF-16BE 0xff, 0xfe, XX, XX (but not 00, 00) -- The file is almost certainly UTF-16LE 0xef, 0xbb, 0xbf, XX -- The file is almost certainly UTF-8 With a BOM 

But what else? If the bytes you receive are something other than one of these five patterns, then you cannot say for sure that your file is or is not UTF-8.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document and is also a simple ASCII document.

There are heuristics that may try to conclude whether the document was encoded, for example, ISO-8859-1 or UTF-8, or CP1252, but, in general, the first two, three or four bytes of a file are not enough to say, is whether what you are watching is definitely UTF-8.

+9
source

0xEF, 0xBB, 0xBF

the order is independent of conformation.

How you read the file with C ++ is up to you. Personally, I still use File C-style methods because they are provided by the library in which I code, and I can be sure of binary mode and avoid inadvertent line breaks.

adapted from cs.vt.edu

 #include <fstream> ... char buffer[100]; ifstream myFile ("data.bin", ios::in | ios::binary); myFile.read (buffer, 3); if (!myFile) { // An error occurred! // myFile.gcount() returns the number of bytes read. // calling myFile.clear() will reset the stream state // so it is usable again. } ... if (!myFile.read (buffer, 100)) { // Same effect as above } if (buffer[0] == 0XEF && buffer[1] == 0XBB && buffer[2] == 0XBF) { //Congrats, UTF-8 } 

Alternatively, in many formats, UTF-8 is used by default unless other specifications are specified (UTF-16 or UTF-32).

wiki for spec

unicode.org.faq

+4
source
 if (buffer[0] == '\xEF' && buffer[1] == '\xBB' && buffer[2] == '\xBF') { // UTF-8 } 

It is better to use buffer[0] == '\xEF' instead of buffer[0] == 0xEF to avoid signing / unsigned char problems, see How to represent negative char values ​​in hexadecimal format?

+3
source

Source: https://habr.com/ru/post/1394268/


All Articles