UnicodeDecodeError with Tesseract OCR in Python

Question

UnicodeDecodeError with Tesseract OCR in Python

I am trying to extract text from an image file using Tesseract OCR in Python, but I ran into an error to figure out how to deal with this. my whole environment is good since I tested the sample sample using ocr in python!

here is the code

from PIL import Image import pytesseract strs = pytesseract.image_to_string(Image.open('binarized_image.png')) print (strs)

next error i get from eclipse console

 strs = pytesseract.image_to_string(Image.open('binarized_body.png')) File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string return f.read().strip() File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

Iam using python 3.5 x64 on Windows10

+6

python tesseract python-tesseract

Nwawel a iroume Dec 15 '15 at 15:37

source share

2 answers

I had the same problem as you, but I had to save the output of pytesseract to a file. So, I created a function for ocr with pytesseract and when saving to the file I added the encoding='utf-8' parameter so my function now looks like this:

 def image_ocr(image_path, output_txt_file_name): image_text = pytesseract.image_to_string(image_path, lang='eng+ces', config='--psm 1') with open(output_txt_file_name, 'w+', encoding='utf-8') as f: f.write(image_text)

I hope this helps someone :)

+2

Novak Oct 2 '18 at 7:47

source share

randomusername · Accepted Answer · 2015-12-15T15:48:34+0000

The problem is that python is trying to use console encoding (CP1252) instead of what it should use (UTF-8). PyTesseract has found the unicode character and is now trying to translate it to CP1252, which it cannot do. On another platform, you will not encounter this error because it will use UTF-8.

You can try using a different function (perhaps one that returns bytes instead of str , so you don’t have to worry about coding). You can change the standard python encoding as indicated in one of the comments, although this will cause problems when trying to print a string on the Windows console. Or, and this is my recommended solution, you can download Cygwin and run python to get clean UTF-8 output.

If you want a quick and dirty solution that won't break anything (yet), here is what you might think:

 import builtins original_open = open def bin_open(filename, mode='rb'): # note, the default mode now opens in binary return original_open(filename, mode) from PIL import Image import pytesseract img = Image.open('binarized_image.png') try: builtins.open = bin_open bts = pytesseract.image_to_string(img) finally: builtins.open = original_open print(str(bts, 'cp1252', 'ignore'))

UnicodeDecodeError with Tesseract OCR in Python

More articles: