Pandas: save to exclude encoding

I have a similar problem with the one mentioned here , but none of the suggested methods work for me.

I have an average utf-8 .csv file size with a lot of characters other than ascii. I split the file into a specific value from one of the columns, and then I would like to save each of the received data frames as a .xlsx file with the characters saved.

This does not work as I get the error message:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 7: ordinal not in range(128) 

Here is what I tried:

  • The use of the xlsxwriter mechanism xlsxwriter explicit. It doesn't seem to change anything.
  • Defining a function (below) for changing the encoding and throwing out bad characters. It also does not change anything.

     def changeencode(data): cols = data.columns for col in cols: if data[col].dtype == 'O': data[col] = data[col].str.decode('utf-8').str.encode('ascii', 'ignore') return data 
  • Manually changing all offensive characters to others. There is still no effect (a quoted error was received after this change).

  • Encoding the file as utf-16 (which, I believe, is the correct encoding, since I want to be able to manipulate the file from excel afterwards) also does not help.

I believe the problem is in the file itself (due to 2 and 3), but I have no idea how to get around it. I would appreciate any help. The beginning of the file is inserted below.

 "Submitted","your-name","youremail","phone","miasto","cityCF","innemiasto","languagesCF","morelanguages","wiek","partnerCF","messageCF","acceptance-795","Submitted Login","Submitted From","2015-12-25 14:07:58 +00:00","Zózia kryś"," test@tes.pl ","4444444","Wrocław","","testujemy polskie znaki","Polski","testujemy polskie znaki","44","test","test","1","Justyna","99.111.155.132", 

EDIT

Some code (one of the versions without a dividing part):

 import pandas as pd import string import xlsxwriter df = pd.read_csv('path-to-file.csv') with pd.ExcelWriter ('test.xlsx') as writer: df.to_excel(writer, sheet_name = 'sheet1',engine='xlsxwriter') 
+5
source share
3 answers

Presumably this was a bug in the pandas version that I used then. Right now in pandas ver. 0.19.2, the code below eliminates the issue of csv (and with the correct encoding) without any problems.
NB: the openpyxl module must be installed on your system.

 import pandas as pd df = pd.read_csv('Desktop/test.csv') df.to_excel('Desktop/test.xlsx', encoding='utf8') 
+3
source

Try to code columns with non-ascii characters as

 df['col'] = df['col'].apply(lambda x: unicode(x)) 

and then save the file in xlsx format with the encoding 'utf8'

+2
source

What if you save the csv files from pandas and then use win32com to convert to Excel. It will look something like this ...

 import win32com.client excel = win32com.client.Dispatch("Excel.Application") excel.Visible = 0 for x in range(10): f = path + str(x) # not showing the pandas dataframe creation df.to_csv(f+'.csv') wb = excel.Workbooks.Open(f+'.csv') wb.SaveAs(f+'.xlsx', 51) #xlOpenXMLWorkbook=51 
0
source

Source: https://habr.com/ru/post/1239318/


All Articles