Teaching OCR to Understand NSA and FISC Editions

Question

Teaching OCR to Understand NSA and FISC Editions

I maintain an archive of heavily edited documents emerging from the Foreign Intelligence Supervision Court.

They come with large sections of text that look like this:

screenshot of redacted text

And when OCR tries to work with this, you get text like:

production of this data daily for 90 days. Sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and

So, in the OCRed version, where there are dark spots, there are only missing words. Sometimes missing words create a grammatically correct sentence with a different / strange meaning (e.g. above). In other cases, the proposals received do not make sense, but in any case this is a problem. It would be much better if the OCR engine could return X for these spots or Unicode squares, for example like.

The result I need is something like:

production of this data daily for 90 days. Sole purpose of this
production - obtaining foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and

My question is how to get these X. Is there a way to analyze the images to identify black spots? Is there a way to replace them with X or some better unicode character? I am open to any ideas to do it right, but image editing for me is not a powerful way for me and is not hacked deep inside the OCR mechanism.

+4

unicode imagemagick ocr tesseract leptonica

mlissner Sep 17 '13 at 22:29

source share

1 answer

nguyenq · Answer 1 · 2013-09-27T13:13:01+0000

You can train Tesseract for these long drops. Depending on the blob length, you must assign a different number of characters to the "X". Read TrainingTesseract3 for the training process.

Teaching OCR to Understand NSA and FISC Editions

More articles: