Detect text file encoding

My program loads the text files provided by the user:

QFile file(fileName); file.open(QIODevice::ReadOnly); QTextStream stream(&file); const QString &text = stream.readAll(); 

This works great when files are encoded in UTF-8 encoding, but some users try to import encoded Windows-1252 files, and if they have words with special characters (for example, "è" in "boutonnière"), this will not show correctly.

Is there a way to detect the encoding, or at least distinguish between UTF-8 (possibly without a spec) and Windows-1252 without asking the user to tell me the encoding?

+4
source share
2 answers

It turns out that automatic encoding detection is not possible for the general case.

However, there is a temporary solution to at least return to the system language if the text is invalid. Text UTF-8 / UTF-16 / UTF-32. It uses QTextCodec::codecForUtfText() , which attempts to decode the byte array using UTF-8, UTF-16, and UTF-32 and returns the default codec if it fails.

Code for this:

 QTextCodec *codec = QTextCodec::codecForUtfText(byteArray, QTextCodec::codecForName("System")); const QString &text = codec->toUnicode(byteArray); 

Update

The above code will not detect UTF-8 without a specification, however, since codecForUtfText () relies on specification markers. To detect UTF-8 without a specification, see fooobar.com/questions/541759 / ....

+4
source

This trick works for me, at least so far. This method does not require specification work:

  QTextCodec::ConverterState state; QTextCodec *codec = QTextCodec::codecForName("UTF-8"); const QByteArray data(readSource()); const QString text = codec->toUnicode(data.constData(), data.size(), &state); if (state.invalidChars > 0) { // Not a UTF-8 text - using system default locale QTextCodec * codec = QTextCodec::codecForLocale(); if (!codec) return; ui->textBrowser->setPlainText(codec->toUnicode(readSource())); } else { ui->textBrowser->setPlainText(text); } 
+3
source

Source: https://habr.com/ru/post/1496979/


All Articles