Tesseract marginal characters are looking for

Question

Tesseract marginal characters are looking for

Is it possible to limit the character set that tesseract is looking for (e.g. search only for the letters az)? This will greatly improve my results.

+56

ocr tesseract

Danilo Bargen Mar 02

source share

5 answers

In addition to the configuration file, the -c flag is located:

 tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6

+14

jmunsch Sep 08 '16 at 9:34

source share

To use the whitelist in the configuration file or using the -c tessedit_char_whitelist=... command-line -c tessedit_char_whitelist=... , in the newest version 4.0 you will need to set the OCR Engine mode to "Only original Tesseract". This is due to the fact that the new LSTM Neural Networks mode does not take into account whitelist settings. Example of the correct command line for version 4.0:

tesseract input_file output_file --oem 0 -c tessedit_char_whitelist = abc123

UPDATE: In newer versions (4.0), the eng.traineddata file installed by default by Windows and some Linux installers is corrupted. The workaround is to replace tessdata\eng.traineddata file from an older version. This file should be about 30 MB. Otherwise, you get an error: "Tesseract cannot load any language!" or similar.

+12

Bartłomiej Uliasz Feb 28 '18 at 13:39

source share

Just add this for anyone using tesseract on Android. In your readOCR function, where you set the language, etc., add the following line:

 tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

you can also make blackList to exclude characters.

+6

user3244591 Mar 21 '17 at 13:03

source share

In Tesseract version 4.00, this is not possible. You can only customize your model or use regular expressions to remove additional characters from the prediction.

0

adel rahimi Apr 26 '19 at 4:24

source share

Blomman · Accepted Answer · 2010-06-06 06:08

Create a configuration file (for example, "letters") in the tessdata / configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the configuration file:

 tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

... or maybe [az] works .. dunno :-)
Then call tesseract similar to this:

 tesseract input.tif output nobatch letters

This will limit tesseract to only recognize the characters you need.

Tesseract marginal characters are looking for

More articles: