Python 2.7 encoding decoding

Question

Python 2.7 encoding decoding

I have a problem with encoding / decoding. I read text from a file and compare it with text from a database (Postgres). The comparison is performed in two lists.

from the file I get "jo \ x9a" for "još", and from the database I get "jo \ xc5 \ xa1" for the same value

common = [a for a in codes_from_file if a in kode_prfoksov] # Items in one but not the other only1 = [a for a in codes_from_file if not a in kode_prfoksov] #Items only in another only2 = [a for a in kode_prfoksov if not a in codes_from_file ]

How to solve this? What encoding should be set when comparing these two lines to solve the problem?

Thank you

+4

python encoding compare

Yebach Mar 21 '12 at 9:41

source share

2 answers

The first seems to be windows-1250 , and the second is utf-8 .

 >>> print 'jo\x9a'.decode('windows-1250') još >>> print 'jo\xc5\xa1'.decode('utf-8') još >>> 'jo\x9a'.decode('windows-1250') == 'jo\xc5\xa1'.decode('utf-8') True

+4

stranac Mar 21 '12 at 9:59

source share

jofel · Accepted Answer · 2012-03-21T10:00:10+0000

Your file lines seem to be encoded in Windows-1250. Your database seems to contain UTF-8 rows.

So you can convert all strings to unicode first:

 codes_from_file = [a.decode("windows-1250") for a in codes_from_file] kode_prfoksov] = [a.decode("utf-8") for a in codes_from_file]

or if you don't want a unicode string, just convert the file string to UTF-8:

 codes_from_file = [a.decode("windows-1250").encode("utf-8") for a in codes_from_file]

Python 2.7 encoding decoding

More articles: