Black and white image of document clustering

Question

Black and white image of document clustering

I have black and white documents (image scanning) and you want to group them according to their layout . To make it more specific, let's say I have the following three images, and the first two are more likely to fall into the same cluster, rather than into the third image, because the first two have a relatively similar layout.

My question is: what would be the best approach to document clustering? Now I have a couple of initial approaches:

get image hash and compare hash
using PCA and some clustering methods (K-tool) to compare smaller representations
extract string using OCR, extract text functions and compare them
extract a row using OCR and do a keyword search

Will there be other better approaches? Again, only the layout matters.

+5

python opencv machine-learning computer-vision cluster-analysis

PSNR Nov 23 '17 at 19:51

source share

1 answer

Anony-mousse · Answer 1 · 2017-11-24T00:55:33+0000

Do not attempt to copy raw data.

Clustering is uncontrollable; it cannot find out which properties are important and what is not. For the clustering algorithm, everything is important.

Instead, first define the appropriate properties. Such as long edges.

Black and white image of document clustering

More articles: