I am setting up a server to do a lot of automatic OCR using tesseract, and I want to do some post-processing of the results.
At the theoretical level, there are many resources, but I practically did not find the practical side.
I assume that there are some basic things you can do, for example:
- Eliminate three identical letters in a row.
- Remove the words "with all the vowels"
- Eliminate the words "longer than a certain length
- Etc.
I didn’t give it a ton of thought, but the text of the OCR 'file gets into the search engine, so preserving the magic of wordmap is a good thing, as it eliminates or fixes words that are clearly erroneous.
If that matters, the content itself is court documents written in English. Therefore, proper names appear from time to time, but the variety of words is probably small, and the fonts are probably quite stable.
Any pointers or good resources I should be aware of?
source share