Pandas Unicode Import export error with to_excel () read_excel ()

Morning.

I reduced a much larger situation to the following:

I have one file with a data framework with some values ​​in it.

df = pd.DataFrame( {'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]}) df: ben joe 0 [dog] [dog] 1 [fish] [cat] 2 [fish] [fish] 3 [bear] [rabbit] 

The type of data contained in this data frame is as follows:

 type(df.iloc[2,1]),df.iloc[2,1] >>> (list, ['fish']) 

When I save the dataframe for excel using pd.to_excel() :

 writer1 = pd.ExcelWriter('Input Output Test.xlsx') df.to_excel(writer1,'Sheet1') writer1.save() 

I immediately read this in the same file as follows:

 dfi = pd.read_excel(open('Input Output Test.xlsx'), sheetname='Sheet1') 

I check the data type again:

 type(dfi.iloc[2,1]),dfi.iloc[2,1] >>> (unicode, u"['fish']") 

Data is now in Unicode format. This is problematic because when I compare two data frames as follows, all the results are false due to inappropriate string formats:

 np.where(df['joe'] == dfi['joe'],True,False) dfi: ben joe test 0 ['dog'] ['dog'] False 1 ['fish'] ['cat'] False 2 ['fish'] ['fish'] False 3 ['bear'] ['rabbit'] False 

What happens during the read and write process causing this change, and how do I change it to save the str post post save?

E: Unfortunately, the nature of my problem dictates the need to save the data frame and manage it in another file.

Edit in response to EdChum's comment: if I instead save these lines as strings and not lists: I still get the same error:

 df = pd.DataFrame({'joe': ['dog', 'cat', 'fish', 'rabbit'], 'ben': ['dog', 'fish', 'fish', 'bear']}) ben joe 0 dog dog 1 fish cat 2 fish fish 3 bear rabbit writer1 = pd.ExcelWriter('Input Output Test Joe.xlsx') df.to_excel(writer1,'Sheet1') writer1.save() dfi = pd.read_excel(open('Input Output Test Joe.xlsx','rb'), sheetname='Sheet1') type(dfi.iloc[2, 1]), dfi.iloc[2, 1] (unicode, u'fish') 

Again, the comparison fails.

Edit: Unicode evaluation for a regular string can also be achieved with ast.literal_eval() , as described here: Converting a string representation of a list to a list in Python or as an EdChum clause.

Note. If you use to_csv() and read_csv() , this problem is missing.

But why does to_excel() / re_excel() change the source code?

+7
source share
1 answer

But why does to_excel () / re_excel () change the source code?

I dont know. I looked briefly at the source to_excel from_excel , but did not find any hints.
Setting engine='xlsxwriter' and leaving encoding as default seems to do this, engine='xlsxwriter' .:

 import pandas as pd df = pd.DataFrame( {'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]}) with pd.ExcelWriter ('Input Output Test.xlsx') as writer: df.to_excel(writer, sheet_name='Sheet1', engine='xlsxwriter') dfi = pd.read_excel('Input Output Test.xlsx') assert eval(dfi.iloc[2,1]) == df.iloc[2,1] # True 
+1
source

Source: https://habr.com/ru/post/1263780/


All Articles