I have done extensive research and cannot find a combination of techniques that will achieve what I need.
I have a situation where I need to perform OCR on hundreds of W2s in order to extract data for reconciliation. W2s are of very poor quality as they are printed and subsequently scanned back to the computer. The above process is beyond my control; Unfortunately, I have to work with what I have.
I was able to successfully complete this process last year, but I had to overdo it, since timeliness was a serious problem. I did this by manually specifying the coordinates for data extraction, then doing OCR on only these segments one at a time. This year I would like to offer a more dynamic situation in anticipation of a change in coordinates, a change in format, etc.
I included the sample scrubbed W2 below. The idea is that each box on W2 is its own rectangle and extracts data, iterating over all the rectangles. I tried several edge detection methods, but none of them delivered exactly what I needed. I believe that I did not find the right combination of preprocessing. I tried to reflect some Sudoku puzzle detection scenarios.

, , python, OpenCV 2 3:

import cv2
import numpy as np
img = cv2.imread(image_path_here)
newx,newy = img.shape[1]/2,img.shape[0]/2
img = cv2.resize(img,(newx,newy))
blur = cv2.GaussianBlur(img, (3,3),5)
ret,thresh1 = cv2.threshold(blur,225,255,cv2.THRESH_BINARY)
gray = cv2.cvtColor(thresh1,cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray,50,220,apertureSize = 3)
minLineLength = 20
maxLineGap = 50
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
for x1,y1,x2,y2 in lines[0]:
cv2.line(img,(x1,y1),(x2,y2),(255,0,255),2)
cv2.imshow('hough',img)
cv2.waitKey(0)