Which encoding looks exactly like ASCII, but has NULL bytes before each byte?

Question

Which encoding looks exactly like ASCII, but has NULL bytes before each byte?

I have a line that looks and behaves as follows (Python code is provided). WTF ?! What is the encoding?

s = u'\x00Q\x00u\x00i\x00c\x00k'
>>> print s
Quick
>>>
>>> s == 'Quick'
False
>>>
>>> import re
>>> re.search('Quick', s)
>>>
>>> import chardet
>>> chardet.detect(s)
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:69: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if aBuf[:3] == '\xEF\xBB\xBF':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:72: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:4] == '\xFF\xFE\x00\x00':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:75: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:4] == '\x00\x00\xFE\xFF':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:78: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:4] == '\xFE\xFF\x00\x00':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:81: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:4] == '\x00\x00\xFF\xFE':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:84: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:2] == '\xFF\xFE':
/usr/lib/pymodules/python2.6/chardet/universaldetector.py:87: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif aBuf[:2] == '\xFE\xFF':
{'confidence': 1.0, 'encoding': 'ascii'}
>>>
>>> chardet.detect(s)
{'confidence': 1.0, 'encoding': 'ascii'}
>>>

+3

python character-encoding

ibz 21 sept '10 at 10:02

source share

2 answers

You have UTF-16BE without specification. As stated in this document, chardet does not clog UTF-nnxE without specification.

>>> s = '\x00Q\x00u\x00i\x00c\x00k' #### Note: dropping the spurious `u` prefix
>>> s.decode('utf_16be')
u'Quick'
>>>

chardet is also not smart enough to raise a DontBeSilly exception if you feed it unicode :-)

+2

John machin 21 sept '10 at 10:16

source share

Delan azabani · Accepted Answer · 2010-09-21T10:03:54+0000

UTF-16 big endian

Which encoding looks exactly like ASCII, but has NULL bytes before each byte?

More articles: