Numerous Pytesseract OCR configuration options

I'm having issues with pytesseract. I need to configure Tesseract so that it is configured to accept single digits, and is also able to accept numbers only if a zero number is often confused with "O".

Like this:

target = pytesseract.image_to_string(im,config='-psm 7',config='outputbase digits') 

Thank you very much,

Niall

+14
source share
2 answers

tesseract-4.0.0a supports below psm . If you want to have single character recognition, set psm = 10 . And if your text consists only of numbers, you can set tessedit_char_whitelist=0123456789 .

 Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Treat the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. 

Here is an example using image_to_string with several options.

 target = pytesseract.image_to_string(image, lang='eng', boxes=False, \ config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789') 

Hope this helps.

+36
source

The reason you are having problems is because character limitation does not work in version 4.0. You must force outdated mode (oem 0) to limit the number of characters found. Somewhere in the tesseract team there is a mistake that they have not yet fixed.

+2
source

Source: https://habr.com/ru/post/1268962/


All Articles