Black and white image of document clustering

I have black and white documents (image scanning) and you want to group them according to their layout . To make it more specific, let's say I have the following three images, and the first two are more likely to fall into the same cluster, rather than into the third image, because the first two have a relatively similar layout.

My question is: what would be the best approach to document clustering? Now I have a couple of initial approaches:

  • get image hash and compare hash
  • using PCA and some clustering methods (K-tool) to compare smaller representations
  • extract string using OCR, extract text functions and compare them
  • extract a row using OCR and do a keyword search

Will there be other better approaches? Again, only the layout matters.

1st image

2nd image

3rd image

+5
source share
1 answer

Do not attempt to copy raw data.

Clustering is uncontrollable; it cannot find out which properties are important and what is not. For the clustering algorithm, everything is important.

Instead, first define the appropriate properties. Such as long edges.

+1
source

Source: https://habr.com/ru/post/1273617/


All Articles