The Unicode standard indicates that the \ufeff has two different meanings. At the beginning of the data stream, it should be used as the signature of the byte and / or encoding, but in another place it should be interpreted as an inextricable space of zero width.
So the code
bom = unicode(codecs.BOM_UTF8, "utf8" ) r = r.replace(bom, '')
Not only does it remove the utf-8 encoding signature (aka BOM) - it also removes any embedded zero-width blanks.
In some earlier versions of python there was no version of the "utf-8" codec, which skips the specification when reading data streams. Since this was not compatible with other Unicode codecs, the utf-8-sig codec was introduced with version 2.5 , which skips the specification.
Thus, it is possible that the βPython errorβ mentioned in the code comments refers to this.
However, it is more likely that the "error" refers to the embedded \ufeff . But since the Unicode standard clearly states that they can be interpreted as legal characters, in fact the data consumer has to decide how to process them, and therefore is not an error in python.
source share