How to check if the file is UTF-8?

I am processing some data files that must be valid UTF-8, but this is not the case, which causes the parser to fail (not under my control). I would like to add a preliminary data verification step to correctly form UTF-8, but I have not yet found a utility to help do this.

There is a web service on W3C that seems dead, and I found a Windows-only validation tool that reports invalid UTF-8 files but does not tell which lines / characters need to be fixed.

I would be happy with either a tool that I can insert and use (ideally cross-platform), or a ruby ​​/ perl script that I can make part of my data loading process.

+62
validation internationalization utf-8
Sep 22 '08 at 14:39
source share
5 answers

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $? 

Or with older versions of iconv, such as on macOS:

 $ iconv -f UTF-8 your_file > /dev/null; echo $? 

The command will return 0 if the file can be successfully converted, and 1 if not. In addition, it will print the offset of the byte in which the wrong sequence of bytes occurred.

Edit : the output encoding does not need to be specified, it is assumed that this is UTF-8.

+83
Sep 22 '08 at 14:48
source share

Use python and str.encode functions | decoding.

 >>> a="γεια" >>> a '\xce\xb3\xce\xb5\xce\xb9\xce\xb1' >>> b='\xce\xb3\xce\xb5\xce\xb9\xff\xb1' # note second-to-last char changed >>> print b.decode("utf_8") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 6: unexpected code byte 

The thrown exception contains the requested information in the .args property.

 >>> try: print b.decode("utf_8") ... except UnicodeDecodeError, exc: pass ... >>> exc UnicodeDecodeError('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte') >>> exc.args ('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte') 
+10
Sep 22 '08 at 14:44
source share

You can use isutf8 from the moreutils collection.

 $ apt-get install moreutils $ isutf8 your_file 

In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

+9
May 26 '16 at 2:49
source share

What about the gnu iconv library? Using the iconv () function: "An invalid multibyte sequence is encountered at the input. In this case, it sets errno to EILSEQ and returns (size_t) (- 1). * Inbuf remains a pointer to the beginning of the invalid multibyte sequence."

EDIT: oh - I missed the part where you want to use the scripting language. But to work on the command line, iconv must also be checked for you.

+4
Sep 22 '08 at 14:46
source share

The C ++ code below is based on one published on many sites over the Internet. I fixed the error in the source code and added the ability to get both an invalid character and an invalid character.

 ///Returns -1 if string is valid. Invalid character is put to ch. int getInvalidUtf8SymbolPosition(const unsigned char *input, unsigned char &ch) { int nb, na; const unsigned char *c = input; for (c = input; *c; c += (nb + 1)) { if (!(*c & 0x80)) nb = 0; else if ((*c & 0xc0) == 0x80) { ch = *c; return (int)c - (int)input; } else if ((*c & 0xe0) == 0xc0) nb = 1; else if ((*c & 0xf0) == 0xe0) nb = 2; else if ((*c & 0xf8) == 0xf0) nb = 3; else if ((*c & 0xfc) == 0xf8) nb = 4; else if ((*c & 0xfe) == 0xfc) nb = 5; na = nb; while (na-- > 0) if ((*(c + nb) & 0xc0) != 0x80) { ch = *(c + nb); return (int)(c + nb) - (int)input; } } return -1; } 
-3
Oct 31
source share



All Articles