I need to check that the buffer contains valid UTF-8 data.
In Python, I can do this simply by trying to decode bytes and check for exceptions. In the example below, I am trying to decode the 1st byte of the encoded "Β’". The exception tells me that I'm skipping bytes.
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> s = 'Β’' >>> s_bytes[:1].decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: unexpected end of data
This approach does not work in node.js because decoding is much more forgiving.
> s = 'Β’' 'Β’' > s_buffer = Buffer(s) <Buffer c2 a2> > s_buffer.toString('utf8', 0, 1) '?' >
I checked the Buffer API page , but I cannot find any method for checking the buffer against the encoding.
source share