I maintain an archive of heavily edited documents emerging from the Foreign Intelligence Supervision Court.
They come with large sections of text that look like this:
And when OCR tries to work with this, you get text like:
production of this data daily for 90 days. Sole purpose of this
production is to obtain foreign intelligence information in support of
individual authorized investigations to protect against international terrorism and
So, in the OCRed version, where there are dark spots, there are only missing words. Sometimes missing words create a grammatically correct sentence with a different / strange meaning (e.g. above). In other cases, the proposals received do not make sense, but in any case this is a problem. It would be much better if the OCR engine could return X for these spots or Unicode squares, for example like.
The result I need is something like:
production of this data daily for 90 days. Sole purpose of this
production - obtaining foreign intelligence information in support of XXXXXXXXXXX
individual authorized investigations to protect against international terrorism and
My question is how to get these X. Is there a way to analyze the images to identify black spots? Is there a way to replace them with X or some better unicode character? I am open to any ideas to do it right, but image editing for me is not a powerful way for me and is not hacked deep inside the OCR mechanism.
source share