I am sure that this question has been answered, but I have no idea what to look for. My problem is not so much in my problem as in others. In short, I have a Python script with text decoding, and it decodes perfectly for me, but it doesn’t work for other users, even with the same code and input.
I wrote a script source ( in Bitbucket ) that converts SMS messages for Windows Mobile 6 (via PIM Backup ) to Android messages (input via SMS Backup and Resotre ) by converting the contents of PIM Backup to an XML format compatible with SMSB & R.
Now PIM Backup displays content in UCS-2 Little Endian format, which is nice because it supports all kinds of international conversations. In my script, I load content using Python's built-in string decoding and create a csv read object with:
# Read the file contents sms_text = csv_file.read().decode('utf-16').split(os.linesep) sms_reader = csv.reader(sms_text, delimiter=';', quotechar='"', escapechar='\\')
Then I process each line of the csv reader with
row = sms_reader.next()
I have this in a try block, because very rarely it throws a UnicodeEncodeError when something is not quite right. But then again, this is very rare for me.
My problem is that it seems to be rushing all the time to other users of my script using non-ASCII characters in their SMS messages. A German user recently contacted me saying that only about 10% of his SMS messages are correctly decoded. He sent me his .pib file, I passed it through my script and did not have any problems in the conversion. The entire output seemed to be standard ANSI / ISO 8859-1 / Windows-1252 / whatever, so it’s hardly exotic.
My question is why these users cannot decode their inputs when I have no problems using exactly the same code (and the Python version)? And as a follow-up, what can I do to change my script so that it works for everyone?
EDIT: One of the important points that I did not mention is that I run the script in Eclipse using PyDev. When I run it on the command line, it throws all the same problems as for everyone else! I still don’t know what the problem is, but hopefully this will help narrow it down.
An example of a very simple .csm file (extracted from a .pib file, names and numbers) with non-standard characters will look like this:
Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments 0x00,0x00;"491703000000";"491703000000";;"";"Wir wünschen dem rainer alles gute und viel gesundheit! Bis nächste woche, wir hören uns bis dahin noch mal.. Liebe grüße aus md!";"";0;"\\%MDF3";"SMS";"IPM.SMStext";;;33;262144;2007,09,23,19,44,32;2007,09,23,19,44,31;1;"851980\;Gela\;+491739000000\;1\;0\;SMS";0;""
It is not trivial to understand that the problem is only to work with this line, since I myself do not experience exceptions.
Another example where I am having problems (even in Eclipse) is the following:
Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments 0x00,0x00;"Jonas/M";"\"Jonas/M\" <+46737000000>";;"";"Den går 28 ";"";2;"\\%MDF4";"SMS";"IPM.SMStext";0;24;0;0;2011,03,12,21,15,19;2011,03,12,21,16,17;0;"";0;"" 0x00,0x00;"Don Vär";"\"Don Vär\" <+46709000000>";;"";"försöke® dhdjhdhhdjehdejehţýùhbfvfghjujhuikjkłánjajnxsjajmsxnsmajmkjsnshdjnsjmwkjhdnjsjmwkjdhjjdewjjwjwjw®";"";2;"\\%MDF1";"SMS";"IPM.SMStext";0;212;1;0;2010,05,17,15,56,49;2010,05,17,15,55,46;0;"";0;""
Exception Tracking:
Traceback (most recent call last): File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 207, in <module> convert(args.source[0], args.out) File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 98, in convert row = sms_reader.next() File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\ue403' in position 77: character maps to <undefined>
UPDATE:
John Machin's answer below works. I just changed one line, and all this is good. The change:
sms_text = csv_file.read().decode('utf-16').split(os.linesep)
To:
sms_text = csv_file.read().decode('utf-16').encode('utf-8').splitlines()