The problem is that python is trying to use console encoding (CP1252) instead of what it should use (UTF-8). PyTesseract has found the unicode character and is now trying to translate it to CP1252, which it cannot do. On another platform, you will not encounter this error because it will use UTF-8.
You can try using a different function (perhaps one that returns bytes instead of str , so you donβt have to worry about coding). You can change the standard python encoding as indicated in one of the comments, although this will cause problems when trying to print a string on the Windows console. Or, and this is my recommended solution, you can download Cygwin and run python to get clean UTF-8 output.
If you want a quick and dirty solution that won't break anything (yet), here is what you might think:
import builtins original_open = open def bin_open(filename, mode='rb'): # note, the default mode now opens in binary return original_open(filename, mode) from PIL import Image import pytesseract img = Image.open('binarized_image.png') try: builtins.open = bin_open bts = pytesseract.image_to_string(img) finally: builtins.open = original_open print(str(bts, 'cp1252', 'ignore'))
source share