Reading unicode from xls in python

I am trying to read a .xls file with Python. The file contains several non-ascii characters (namely รครถรผ). I tried both with openpyxls and xlrd (I had high hopes with xlrd, since it seems to read everything in Unicode), without working with anyone.

I found multiple answers regarding encoding / decoding while trying to print information from xls, but I can't even get that far. This script does not work right after just reading the file:

import xlrd workbook = xlrd.open_workbook('export_data.xls') 

Result:

 Traceback (most recent call last): File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module> workbook = xlrd.open_workbook('export_data.xls') File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook ragged_rows=ragged_rows, File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls bk.get_sheets() File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets self.get_sheet(sheetno) File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet sh.read(self) File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2) File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string return unicode(data[pos:pos+nchars], encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 55: ordinal not in range(128) WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero *** No CODEPAGE record, no encoding_override: will use 'ascii' *** No CODEPAGE record, no encoding_override: will use 'ascii' 

I also tried:

 workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8") 

as a result of:

 Traceback (most recent call last): File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module> workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8") File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook ragged_rows=ragged_rows, File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls bk.get_sheets() File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets self.get_sheet(sheetno) File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet sh.read(self) File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2) File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string return unicode(data[pos:pos+nchars], encoding) UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 55: invalid start byte WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero 

including at the top of the various versions:

 # -*- coding: utf-8 -*- 

I am running this on python 2.7 on a computer running Windows Server 2008.

+4
source share
3 answers

Thank you all for your feedback!

In the end, I fixed it with the encoding_override function. I could not find the Microsoft documentation for which the cp code matches German characters, so I tried all of them. In the end, I got to cp1251 and it worked!

 workbook = xlrd.open_workbook(path, encoding_override="cp1251") 
+1
source

From my reading of OOo docs, xls used unfode utf_16_le unicode, not utf8 (that is, it uses exactly two bytes per character stored in small-endian), so try:

 workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf_16_le") 

(see page 17 of http://www.openoffice.org/sc/excelfileformat.pdf )

0
source

A bit late, but I hope you tried unicodecsv for encoding.

0
source

Source: https://habr.com/ru/post/1487060/


All Articles