I have black and white documents (image scanning) and you want to group them according to their layout . To make it more specific, let's say I have the following three images, and the first two are more likely to fall into the same cluster, rather than into the third image, because the first two have a relatively similar layout.
My question is: what would be the best approach to document clustering? Now I have a couple of initial approaches:
- get image hash and compare hash
- using PCA and some clustering methods (K-tool) to compare smaller representations
- extract string using OCR, extract text functions and compare them
- extract a row using OCR and do a keyword search
Will there be other better approaches? Again, only the layout matters.



source share