Separating Background / Foreground Layers in a Scanned Document

I need to automatically remove the softly colored background of the scanned document image for OCR.

ScanTailor is a C ++-based open source application that makes background separation by the way, but I cannot figure out how to work only the last step, which actually removes the background.

Ideally, I could find code that does this, and either:

  • Put this part in C #
  • Modify C ++ to respond to command line execution by performing only this step in this image

Can you help me understand how I can do this?
or do you know other libraries that can do this? (any language / platform acceptable)

+1
source share
2 answers

You mean the Thresholding, Despeckling, and Noise Removal methods that are needed in OCR applications.

The quality of the results depends on many factors -

Original print quality Scan quality Image resolution Used background colors and patterns. Noise and other marks.

You can find the IEvolution.NET library at http://www.hi-components.com/nievolution.asp useful. It has many image processing functions.

There are many commercial engines. There is no ideal function to solve image processing problems. You must adapt the functions and parameter to suit your images. http://www.recogniform.com/thresholding.htm

A search on Google will show many results.

+3
source

Maybe the algorithm is approximately:

  • Determine what background color
  • Scan a bitmap for pixels whose color (and / or is pretty similar) to the background color
  • Convert these pixels to white or transparent.
  • It is possible (especially if the page contains images, not just text) ignore isolated pixels, which are the background color but are not next to other background pixels.

If this is a low resolution image (for example, a black and white image with high resolution), you need to apply this algorithm to groups of pixels .

+1
source

Source: https://habr.com/ru/post/1333151/


All Articles