Unicode vs Bytes
Firstly, some terminology. There are two types of strings, encoded and decoded:
- encoded. This is what is stored on disk. For Python, this is a bunch of 0 and 1, which you can consider as ASCII, but it can be anything - binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it is more accurately called the "bytes" variable.
- Decoded. This is a string of actual characters. They can be encoded in 8-bit ASCII strings or can be encoded for 32-bit Chinese characters. But until it goes into an encoded variable, it's just a Unicode character string.
What does this mean for you
So here it is. You said you get one ASCII variable and one Unicode variable. This is actually not the case.
- You have one variable containing a string of bytes - ones and zeros, presumably in sets of 8. This is a variable that you assumed was incorrectly ASCII.
- You have another variable that contains Unicode data - numbers, letters, and symbols.
Before comparing a byte string with a Unicode character string, you should make some assumptions. In your case, Python (and you) suggested that the byte string is ASCII encoded. This worked fine until you came across a character who was not ASCII - a character with an accent.
So, you need to find out what this string of bytes encodes. It could be latin1. If so, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, this is all you need to do. But if your byte string is encoded into something else, you need to find out what encoding is there and pass it for decoding ().
There is no simple answer on the bottom line unless you know (or make some assumptions) about encoding your input.
What will i do
Try running var.decode ('latin1') in the byte string. This will give you a Unicode variable. If this works and the data looks correct (i.e., Symbols with shock marks look as if they belong), collapse with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 - another common encoding.
source share