Tesseract marginal characters are looking for

Is it possible to limit the character set that tesseract is looking for (e.g. search only for the letters az)? This will greatly improve my results.

+56
ocr tesseract
Mar 02
source share
5 answers

Create a configuration file (for example, "letters") in the tessdata / configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the configuration file:

 tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz 

... or maybe [az] works .. dunno :-)
Then call tesseract similar to this:

 tesseract input.tif output nobatch letters 

This will limit tesseract to only recognize the characters you need.

+75
Jun 06 '10 at 6:08
source share

In addition to the configuration file, the -c flag is located:

 tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6 
+14
Sep 08 '16 at 9:34
source share

To use the whitelist in the configuration file or using the -c tessedit_char_whitelist=... command-line -c tessedit_char_whitelist=... , in the newest version 4.0 you will need to set the OCR Engine mode to "Only original Tesseract". This is due to the fact that the new LSTM Neural Networks mode does not take into account whitelist settings. Example of the correct command line for version 4.0:

tesseract input_file output_file --oem 0 -c tessedit_char_whitelist = abc123

UPDATE: In newer versions (4.0), the eng.traineddata file installed by default by Windows and some Linux installers is corrupted. The workaround is to replace tessdata\eng.traineddata file from an older version. This file should be about 30 MB. Otherwise, you get an error: "Tesseract cannot load any language!" or similar.

+12
Feb 28 '18 at 13:39
source share

Just add this for anyone using tesseract on Android. In your readOCR function, where you set the language, etc., add the following line:

 tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ"); 

you can also make blackList to exclude characters.

+6
Mar 21 '17 at 13:03
source share

In Tesseract version 4.00, this is not possible. You can only customize your model or use regular expressions to remove additional characters from the prediction.

0
Apr 26 '19 at 4:24
source share



All Articles