How to check if a text file is encoded in UTF-8?

How to check if a text file is encoded in UTF-8 in C ++?

+4
source share
4 answers

Try this article to see if it helps.

+5
source

Try reading it as UTF-8 and see if the UTF-8 encoding is broken or not, and if not, if there are only valid Unicode points.

But still there is no guarantee that the file is in UTF-8 or ASCII or anything else. How would you interpret a file containing one byte, the letter A ? ASCII? UTF-8? Others? Likewise, if a file starts with BOM for pure luck, but is it really not UTF-8, or not intended for UTF -8?

This article may be of interest.

+4
source

You cannot know for sure that any piece of binary data was intended to represent UTF-8. However, you can always check if it can be interpreted as UTF-8. The easiest way is to simply try and convert it (say, to UTF-32) and see if there are any errors. If all you need is a check, then you can do the same without actually writing the output. (You will need to write this yourself, but it's easy.)

Please note that for security it is extremely important to completely abort the conversion on the first error and not try to somehow "restore".

+4
source

Try converting to UTF-16. If you do not get any errors, then this is most likely UTF-8. But no matter what you do, this is still the best guess.

0
source

Source: https://habr.com/ru/post/1388273/


All Articles