For clustering documents. the best approach is to use the k-mean algorithm . If you know how many types of documents you have, you know what k is.
So that he works on documents:
a) select random source k-documents.
b) Assign each document to the cluster using the minimum distance for the document with the cluster.
c) After the documents are assigned to the cluster, make K new documents as a cluster, taking the centroid of each cluster.
Now the question is:
a) How to calculate the distance between two documents: Its nothing more than a cosine similarity of the terms of documents with the initial cluster. The terms here are nothing more than TF-IDF (calculated earlier for each document)
b) Centroid should be : the sum of the TF-IDF of the term / no. documents. Do this for all possible terms in the cluster. this will give you more n-dimensional documents.
Hope this helps!
source share