How to perform word break on a scanned page using python?

Question

How to perform word break on a scanned page using python?

Is it possible to cut a scanned image of text into several images containing one word each? those. if we crawl the page using "n" words, then the script should create "n" separate images.

(using python)

+4

python image-processing python-imaging-library

Arackna Mar 09 '11 at 19:25

source share

2 answers

You need to look at Blob Detection , this is an image processing technology. Also, this question has nothing to do with python, but looking for python blob detection libraries may help.

+2

cmaynard Mar 09 '11 at 19:29

source share

Chris farmiloe · Accepted Answer · 2011-03-09T22:31:16+0000

This is not an area in which I am very familer, but assuming you cannot use OCR (because your text is illegible or something), I would (probably naively) try something like:

load image data into memory
splitting pixel data into image lines
find each “line” that has only white pixels: notice them as “white lines”
For each column in each white row, try to find white spaces.
take all your new x, y coordinates and crop the image.

It actually sounded like a fun exercise, so I gave it a project with pyPNG :

import png import sys KERNING = 3 def find_rows(pixels,width, height): "find all rows that are purely white" white_rows = [] is_white = False for y in range(height): if sum(sum( pixels[(y*4*width)+x*4+p] for p in range(3)) for x in range(width)) >= width*3*254: if not is_white: white_rows.append(y) is_white = True else: is_white = False return white_rows def find_words_in_image(blob, tolerance=30): n = 0 r = png.Reader(bytes=blob) (width,height,pixels_rows,meta) = r.asRGBA8() pixels = [] for row in pixels_rows: for px in row: pixels.append(px) # find each horizontal line white_rows = find_rows(pixels,width,height) # for each line try to find a white vertical gap for i,y in enumerate(white_rows): if y >= len(white_rows): continue y2 = white_rows[i+1] height_of_row = y2 - y is_white = False white_cols = [] last_black = -100 for x in range(width-4): s = y*4*width+x*4 if sum(pixels[s+y3*4*width] + pixels[s+y3*4*width+1] + pixels[s+y3*4*width+2] for y3 in range(height_of_row)) >= height_of_row*3*240: if not is_white: if len(white_cols)>0 and x-last_black < KERNING: continue white_cols.append(x) is_white = True else: is_white = False last_black = x # now we have a list of x,y co-oords for all the words on this row for j,x in enumerate(white_cols): if j >= len(white_cols)-1: continue wordpx = [] new_width = white_cols[j+1]-x new_height = y2-y x_offset = x*4 for h in range(new_height): y_offset = (y+h)*4*width start = x_offset+y_offset wordpx.append( pixels[start:start+(new_width*4)] ) n += 1 with open('word%s.png' % n, 'w') as f: w = png.Writer( width=new_width, height=new_height, alpha=True ) w.write(f,wordpx) return n if __name__ == "__main__": # # USAGE: python png2words.py yourpic.png # # OUTPUT: [word1.png...word2.png...wordN.png] # n = find_words_in_image( open(sys.argv[1]).read() ) print "found %s words" % n

How to perform word break on a scanned page using python?

More articles: