Tf-idf: Do I understand correctly?

I am interested in creating document clustering, and now I am considering using TF-IDF for this.

If I'm not mistaken, TF-IDF is especially used to evaluate the relevance of a document based on a query . If I don't have a specific request, how can I apply tf-idf to clustering?

+4
source share
3 answers

Not really: tf-idf gives you the relevance of a word in a document .
So you can use it for clustering to calculate proximity , which would be like

proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j)) 

for each term t both in doc i and in doc j.

+4
source

For clustering documents. the best approach is to use the k-mean algorithm . If you know how many types of documents you have, you know what k is.

So that he works on documents:

a) select random source k-documents.

b) Assign each document to the cluster using the minimum distance for the document with the cluster.

c) After the documents are assigned to the cluster, make K new documents as a cluster, taking the centroid of each cluster.

Now the question is:

a) How to calculate the distance between two documents: Its nothing more than a cosine similarity of the terms of documents with the initial cluster. The terms here are nothing more than TF-IDF (calculated earlier for each document)

b) Centroid should be : the sum of the TF-IDF of the term / no. documents. Do this for all possible terms in the cluster. this will give you more n-dimensional documents.

Hope this helps!

+4
source

TF-IDF fulfills another purpose; if you are not going to reinvent the wheel, you better use a tool like Carrot . Google for document clustering can give you a lot of algorithms if you want to implement them yourself.

+1
source

Source: https://habr.com/ru/post/1305426/


All Articles