Python String Comparison - Special / Unicode Character Issues

I am writing a Python script to process some music data. It should combine two separate databases, comparing their records and juxtaposing them. It almost works, but it does not work when comparing strings containing special characters (for example, letters with an accent). I am sure this is a problem with ASCII and Unicode encoding, as I get the error:

"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as unequal"

I understand that I can use regular expressions to remove offensive characters, but I process a lot of data and rely too much on regular expressions, which makes my program the smallest slow. Is there a way to properly compare Python with these strings? What happens here - is there a way to determine if it stores my strings as ASCII or Unicode?

EDIT 1: I am using Python v2.6.6. After checking the types, I found that one database spits out Unicode strings, and one gives ASCII. So there are probably problems. I am trying to convert ASCII strings from a second database to Unicode using a string like

line = unicode(f.readline().decode(latin_1).encode(utf_8)) 

but this gives an error, for example:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128) 

I am not sure why the 'ascii' codec is complaining since I am trying to decode ASCII. Can anyone help?

+4
source share
4 answers

Unicode vs Bytes

Firstly, some terminology. There are two types of strings, encoded and decoded:

  • encoded. This is what is stored on disk. For Python, this is a bunch of 0 and 1, which you can consider as ASCII, but it can be anything - binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it is more accurately called the "bytes" variable.
  • Decoded. This is a string of actual characters. They can be encoded in 8-bit ASCII strings or can be encoded for 32-bit Chinese characters. But until it goes into an encoded variable, it's just a Unicode character string.

What does this mean for you

So here it is. You said you get one ASCII variable and one Unicode variable. This is actually not the case.

  • You have one variable containing a string of bytes - ones and zeros, presumably in sets of 8. This is a variable that you assumed was incorrectly ASCII.
  • You have another variable that contains Unicode data - numbers, letters, and symbols.

Before comparing a byte string with a Unicode character string, you should make some assumptions. In your case, Python (and you) suggested that the byte string is ASCII encoded. This worked fine until you came across a character who was not ASCII - a character with an accent.

So, you need to find out what this string of bytes encodes. It could be latin1. If so, you want to do this:

 if unicode_variable == string_variable.decode('latin1') 

Latin1 is basically ASCII plus some extended characters like Ç and Â.

If your data is in Latin1, this is all you need to do. But if your byte string is encoded into something else, you need to find out what encoding is there and pass it for decoding ().

There is no simple answer on the bottom line unless you know (or make some assumptions) about encoding your input.

What will i do

Try running var.decode ('latin1') in the byte string. This will give you a Unicode variable. If this works and the data looks correct (i.e., Symbols with shock marks look as if they belong), collapse with it.

Oh, and if latin1 doesn't parse or doesn't look right, try utf8 - another common encoding.

+5
source

You may need to pre-process the databases and convert everything to UTF-8. I assume that you have Latin characters with 1 accent in some entries.


As for your question, the only way to know for sure is to look. Ask the script to spit out those that don't compare, and find the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
0
source

Converting both to unicode should help:

 if unicode(str1) == unicode(str2): print "same" 
0
source

To find out if you save (not this) your lines as str or unicode objects, print type(your_string) .

You can use print repr(your_string) to show yourself (and us) unambiguously what is on your line.

By the way, exactly what version of Python are you using on which OS? If Python 3.x, use ascii() instead of repr() .

0
source

Source: https://habr.com/ru/post/1342757/


All Articles