Python decoding works for me, but not for others

I am sure that this question has been answered, but I have no idea what to look for. My problem is not so much in my problem as in others. In short, I have a Python script with text decoding, and it decodes perfectly for me, but it doesn’t work for other users, even with the same code and input.

I wrote a script source ( in Bitbucket ) that converts SMS messages for Windows Mobile 6 (via PIM Backup ) to Android messages (input via SMS Backup and Resotre ) by converting the contents of PIM Backup to an XML format compatible with SMSB & R.

Now PIM Backup displays content in UCS-2 Little Endian format, which is nice because it supports all kinds of international conversations. In my script, I load content using Python's built-in string decoding and create a csv read object with:

# Read the file contents sms_text = csv_file.read().decode('utf-16').split(os.linesep) sms_reader = csv.reader(sms_text, delimiter=';', quotechar='"', escapechar='\\') 

Then I process each line of the csv reader with

 row = sms_reader.next() 

I have this in a try block, because very rarely it throws a UnicodeEncodeError when something is not quite right. But then again, this is very rare for me.

My problem is that it seems to be rushing all the time to other users of my script using non-ASCII characters in their SMS messages. A German user recently contacted me saying that only about 10% of his SMS messages are correctly decoded. He sent me his .pib file, I passed it through my script and did not have any problems in the conversion. The entire output seemed to be standard ANSI / ISO 8859-1 / Windows-1252 / whatever, so it’s hardly exotic.

My question is why these users cannot decode their inputs when I have no problems using exactly the same code (and the Python version)? And as a follow-up, what can I do to change my script so that it works for everyone?

EDIT: One of the important points that I did not mention is that I run the script in Eclipse using PyDev. When I run it on the command line, it throws all the same problems as for everyone else! I still don’t know what the problem is, but hopefully this will help narrow it down.

An example of a very simple .csm file (extracted from a .pib file, names and numbers) with non-standard characters will look like this:

 Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments 0x00,0x00;"491703000000";"491703000000";;"";"Wir wünschen dem rainer alles gute und viel gesundheit! Bis nächste woche, wir hören uns bis dahin noch mal.. Liebe grüße aus md!";"";0;"\\%MDF3";"SMS";"IPM.SMStext";;;33;262144;2007,09,23,19,44,32;2007,09,23,19,44,31;1;"851980\;Gela\;+491739000000\;1\;0\;SMS";0;"" 

It is not trivial to understand that the problem is only to work with this line, since I myself do not experience exceptions.

Another example where I am having problems (even in Eclipse) is the following:

 Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments 0x00,0x00;"Jonas/M";"\"Jonas/M\" <+46737000000>";;"";"Den går 28 ";"";2;"\\%MDF4";"SMS";"IPM.SMStext";0;24;0;0;2011,03,12,21,15,19;2011,03,12,21,16,17;0;"";0;"" 0x00,0x00;"Don Vär";"\"Don Vär\" <+46709000000>";;"";"försöke® dhdjhdhhdjehdejehţýùhbfvfghjujhuikjkłánjajnxsjajmsxnsmajmkjsnshdjnsjmwkjhdnjsjmwkjdhjjdewjjwjwjw®";"";2;"\\%MDF1";"SMS";"IPM.SMStext";0;212;1;0;2010,05,17,15,56,49;2010,05,17,15,55,46;0;"";0;"" 

Exception Tracking:

 Traceback (most recent call last): File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 207, in <module> convert(args.source[0], args.out) File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 98, in convert row = sms_reader.next() File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\ue403' in position 77: character maps to <undefined> 

UPDATE:

John Machin's answer below works. I just changed one line, and all this is good. The change:

 sms_text = csv_file.read().decode('utf-16').split(os.linesep) 

To:

 sms_text = csv_file.read().decode('utf-16').encode('utf-8').splitlines() 
+4
source share
1 answer

You can start by providing us with a sample PIM backup file that you can read, and a German user cannot read.

The fact that you sometimes get a UnicodeEncodeError (note Encode not Decode) is significant. Take care to change the code to display the exact error message and the trace you get, instead of suppressing them?

Do you run this on Linux / OSX / Windows? If windows, in the command prompt window? If so, what does the CHCP team tell you? What does he say to your German correspondent?

Have you read what csv docs have to say about Unicode? Here's what happens:

 >>> import csv >>> r = csv.reader([u"\xA0"]) >>> r.next() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128) >>> 

You are much more likely to get this if you follow these steps:

  • read raw bytes in a file
  • decode a byte string in Unicode using UTF-16
  • encode Unicode string to UTF-8
  • split the UTF-8 line into a list of lines (use str.splitlines() )
  • remove csv reader from this list
  • line iteration, decoding of each cell from UTF-8 to Unicode.

Refresh. I do not see anything in your changes to your question, to make me change my previous advice. You have the option to omit step 6 above (this will work, but it will be evil) or enable step 6 and rewrite your output phase to use [c]ElementTree or lxml for UTF-8 encoding, escaping, etc. By the way, you are writing XML files that say they are encoded in UTF-8. I cannot reproduce this because I do not have Eclipse, but I suspect that the XML files that you write “OK” when working in Eclipse are actually encoded in cp1252 . Have you tried them with XML validation?

Your problem with the U + E403 character is only part of the problem that your script will "work" only with characters that are present in all the encodings that the csv module selects when it encounters a unicode input. This symbol is located in one of the PUA units (Private User Area) allocated to vendor-specific materials (such as the Apple symbol) or applications. It does not apply to any of the Python coding codes and cannot be displayed properly (because it is not published in font). googling ("emoji E403"), and, following the above conclusions, indicates that it can be U + 1F614 PENSIVE FACE, new in Unicode 6.0.

+2
source

Source: https://habr.com/ru/post/1369640/


All Articles