Align text for OCR

I create a database of historical records that I have as photographed pages from books (+ 100 thousand pages). I wrote some python code to do some image processing before I open every page. Since the data in these books does not go into well-formatted tables, I need to segment each page in rows and columns, and then separate each part separately.

One important step is to align the text in the image.

For example, this is a typical page that should be aligned: alignment page

The solution I found is to blur the text horizontally (I use skimage.ndimage.morphology.binary_dilation) and find a rotation that maximizes the sum of the white pixels along the horizontal size.

This works fine, but it takes about 8 seconds per page, given the volume of the pages I'm working with, too much.

Do you know which is better, faster to align text?

Update:

I use a scikit image for image processing and scipy functions to maximize the number of white pixels along the horizontal axis.

Here is a link to the html view of the Jupyter laptop I was working on. The code uses some functions from the module that I wrote for this project, so it cannot be launched by itself.

Notebook link (dropbox): https://db.tt/Mls9Tk8s

Update 2:

Here is the link to the original source image (dropbox): https://db.tt/1t9kAt0z

+5
source share
3 answers

Preface: I did not do a lot of image processing using python. I can give you an image processing suggestion, but you have to implement it in Python yourself. All you need is FFT and polar conversion (I think OpenCV has a built-in function for this ), so this should be simple.

You just placed one model image, so I don’t know if this works for other images, but the Fourier transform can be very useful for this image: just substitute the image with a good power of two (for example, 2048x2048), and you get the Fourier spectrum as follows way:

enter image description here

I posted an intuitive explanation of the Fourier transform here , but in short: your image can be represented as a series of sin / cosine waves, and most of these “waves” are parallel or perpendicular to the orientation of the document. That's why you see a strong frequency response at about 0 °, 90 °, 180 ° and 270 °. To measure the exact angle, you can take the polar transformation of the Fourier spectrum:

enter image description here

and just take the column value:

enter image description here

The peak position in this diagram is 90.835 °, and if I rotate the image by -90.835 modulo 90, the orientation looks decent:

enter image description here

As I said, I no longer have test images, but it works for rotated versions of your image. At the very least, this should narrow the search space for a more expensive search method.

Note 1: FFT is fast, but larger images take longer. And, unfortunately, the best way to get a better resolution in the angle is to use a larger input image (i.e. with a whiter complement around the original image.)

Note 2: The FFT actually returns the image where the “DC” (center in the spectrum image above) is at the origin 0/0. But the rotation property is clearer if you move it to the center, and this will simplify the polar transformation, so I just showed the shifted version.

+11
source

This is not a complete solution, but more thoughts than comments.

You have a box left and right and above and below your image. If you delete this and even cut the text in the process, you will still have enough information to align the image. So, if you chop off, say, 15%, from above, from below, left and right, you will reduce the image area by 50%, which will accelerate the movement along the line.

Now take the remaining central area and divide it into, say, 10 strips of the same height, but the entire width of the page. Now calculate the average brightness of these stripes and take 1-4 of the darkest, because they contain the most (black) inscription. Now work on each of them in parallel or just the darkest. Now you process only the most interesting 5-20% of the page.

Here's the command to do it in ImageMagick - it's just my weapon of choice, and you can do it just as well in Python.

convert scan.jpg -crop 300x433+64+92 -crop x10@ -format "%[fx:mean]\n" info: 0.899779 0.894842 0.967889 0.919405 0.912941 0.89933 0.883133 <--- choose 4th last because it is darkest 0.889992 0.88894 0.888865 

If I make separate images from these 10 stripes, I get this

 convert scan.jpg -crop 300x433+64+92 -crop x10@ m-.jpg 

enter image description here

and effectively, I am aligning on the fourth last image, and not on the whole image.

There may be an unscientific, but rather effective and fairly simple way to try.

Another thought, if you have your procedure / script sorted to straighten a single image, don't forget that you can often get mass acceleration using GNU Parallel to simultaneously chase all your beautiful expensive processor cores. Here I indicate 8 processes for parallel operation ...

 #!/bin/bash for ((i=0;i<100000;i++)); do ProcessPage $i done | parallel --eta -j 8 
+3
source

“align text in image” I assume that this means aligning the image so that the text lines have the same baseline.

I really enjoyed reading the scientific answers to this rather difficult task. The answers are great, but is it really necessary to spend so much time (a very valuable resource) to implement this? Many tools are available for this function, without the need to write one line of code (unless the OP is a CS student and does not want to practice science, but obviously the OP does this as necessary to process all the images). These methods returned me to my colleges, but today I would use various tools to quickly and efficiently process this batch, which I do daily. I work for a high-volume document processing and data mining bureau and an OCR consulting company.

Here is the result of a basic open and parallel step in the OCR ad unit from ABBYY FineReader. For further processing, OCR was more than sufficient. enter image description here

And I did not need to recreate and program my own browser in order to post this answer.

0
source

Source: https://habr.com/ru/post/1235902/


All Articles