Unable to match Python Unicode comparison

This question is related to Unicode Character Search in Python

I am reading a Unicode text file using python codecs

codecs.open('story.txt', 'rb', 'utf-8-sig') 

And tried to look for lines in it. But I get the following warning.

 UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal 

Is there any special way to compare strings in Unicode?

+42
python unicode
Aug 12 '13 at 17:43
source share
1 answer

You can use the == operator to compare Unicode objects for equality.

 >>> s1 = u'Hello' >>> s2 = unicode("Hello") >>> type(s1), type(s2) (<type 'unicode'>, <type 'unicode'>) >>> s1==s2 True >>> >>> s3='Hello'.decode('utf-8') >>> type(s3) <type 'unicode'> >>> s1==s3 True >>> 

But your error message indicates that you are not comparing unicode objects. You are probably comparing a unicode object with a str object, for example:

 >>> u'Hello' == 'Hello' True >>> u'Hello' == '\x81\x01' __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal False 

See how I tried to compare a unicode object with a string that does not represent valid UTF8 encoding.

Your program, I suppose, compares unicode objects with str objects, and the contents of the str object are not valid UTF8 encoding. This seems like a likely result of the fact that you (the programmer) do not know which variable contains unicide, which variable contains UTF8 and which variable contains bytes read from the file.

I recommend http://nedbatchelder.com/text/unipain.html , especially the advice on creating a Unicode sandwich.

+60
Aug 12 '13 at 18:43
source share
β€” -



All Articles