Is unicode (codecs.BOM_UTF8, "utf8") necessary in Python 2.7 / 3?

In the code review, I came across the following code:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF) # as a character. # If untreated, it can prevent the page from validating or rendering # properly. bom = unicode( codecs.BOM_UTF8, "utf8" ) r = r.replace(bom, '') 

This is a function that passes a string to a Response object (Django or Flask).

Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me that this is not so, but I thought I would ask because I do not know this problem very well.

I'm not sure where it came from, but I saw it on the Internet, sometimes referring to Jinja2 (which we use).

Thank you for reading.

+6
source share
2 answers

The Unicode standard indicates that the \ufeff has two different meanings. At the beginning of the data stream, it should be used as the signature of the byte and / or encoding, but in another place it should be interpreted as an inextricable space of zero width.

So the code

 bom = unicode(codecs.BOM_UTF8, "utf8" ) r = r.replace(bom, '') 

Not only does it remove the utf-8 encoding signature (aka BOM) - it also removes any embedded zero-width blanks.

In some earlier versions of python there was no version of the "utf-8" codec, which skips the specification when reading data streams. Since this was not compatible with other Unicode codecs, the utf-8-sig codec was introduced with version 2.5 , which skips the specification.

Thus, it is possible that the β€œPython error” mentioned in the code comments refers to this.

However, it is more likely that the "error" refers to the embedded \ufeff . But since the Unicode standard clearly states that they can be interpreted as legal characters, in fact the data consumer has to decide how to process them, and therefore is not an error in python.

+7
source

BOM is a byte sequence that indicates which Unicode encoding is used.

The BOM is used to inform the decoder about how to convert bytes to Unicode (where Unicode can have a different binary representation).

It makes no sense to try to put the specification inside a Unicode string.

0
source

Source: https://habr.com/ru/post/901304/


All Articles