Basic document processing (spelling corrector)

I am setting up a server to do a lot of automatic OCR using tesseract, and I want to do some post-processing of the results.

At the theoretical level, there are many resources, but I practically did not find the practical side.

I assume that there are some basic things you can do, for example:

  • Eliminate three identical letters in a row.
  • Remove the words "with all the vowels"
  • Eliminate the words "longer than a certain length
  • Etc.

I didn’t give it a ton of thought, but the text of the OCR 'file gets into the search engine, so preserving the magic of wordmap is a good thing, as it eliminates or fixes words that are clearly erroneous.

If that matters, the content itself is court documents written in English. Therefore, proper names appear from time to time, but the variety of words is probably small, and the fonts are probably quite stable.

Any pointers or good resources I should be aware of?

+4
source share
1 answer

Each OCR engine will have its own set of common errors, which will also depend on the fonts in the document, scan quality, dpi used, background color and image preprocessing, for example, despeckle, deskew, line removal. You will only find out what these errors are by performing many test runs and analyzing the results of a search for a common set of errors.

Using the right scanner settings and image pre-processing algorithms can significantly improve recognition results. Do not underestimate this part.

If the text is mainly English words, then a good dictionary with a fuzzy type of search engine will be very useful. Other useful methods are trigram analysis and voting using the second OCR mechanism.

-one
source

Source: https://habr.com/ru/post/1392603/


All Articles