I have a large csv file that contains Unicode characters that cause errors in the Python script I'm trying to run. My process of removing them so far has been rather tedious. I run my script, and as soon as it falls into the unicode character, I get an error message:
'ascii' codec can't encode character u'\xef' in position 197: ordinal not in range(128)
Then I google u '\ xef' and try to figure out what the character really is (Does anyone know a website with a list of these definitions?). I use this information to create a dictionary, and I have a second Python script that converts Unicode characters to plain text:
unicode_dict = {"\xb0":"deg", "\xa0":" ", "\xbd":"1/2", "\xbc":"1/4", "\xb2":"^2", "\xbe":"3/4"}
for f in glob.glob(r"C:\Folder1\*.csv"):
in_csv = f
out_csv = f.replace(".csv", "_2.csv")
write_f=open(out_csv, "wb")
writer = csv.writer(write_f)
with open(in_csv,'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_row = []
for s in row:
for k, v in unicode_dict.iteritems():
s = s.replace(k, v)
new_row.append(s)
writer.writerow(new_row)
write_f.close()
os.remove(in_csv)
os.rename(out_csv, in_csv)
Then I need to run the code again, get another error and find the next Unicode character in Google. There must be a better way, right?