Tf-idf: Do I understand correctly?

Question

Tf-idf: Do I understand correctly?

I am interested in creating document clustering, and now I am considering using TF-IDF for this.

If I'm not mistaken, TF-IDF is especially used to evaluate the relevance of a document based on a query . If I don't have a specific request, how can I apply tf-idf to clustering?

+4

language-agnostic algorithm text-processing information-retrieval tf-idf

alskndalsnd Mar 29 '10 at 7:08

source share

3 answers

Pierroz · Answer 1 · 2010-03-29T07:40:54+0000

Not really: tf-idf gives you the relevance of a word in a document .
So you can use it for clustering to calculate proximity , which would be like

proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j))

for each term t both in doc i and in doc j.

Kapil d · Answer 2 · 2010-04-02T00:47:05+0000

For clustering documents. the best approach is to use the k-mean algorithm . If you know how many types of documents you have, you know what k is.

So that he works on documents:

a) select random source k-documents.

b) Assign each document to the cluster using the minimum distance for the document with the cluster.

c) After the documents are assigned to the cluster, make K new documents as a cluster, taking the centroid of each cluster.

Now the question is:

a) How to calculate the distance between two documents: Its nothing more than a cosine similarity of the terms of documents with the initial cluster. The terms here are nothing more than TF-IDF (calculated earlier for each document)

b) Centroid should be : the sum of the TF-IDF of the term / no. documents. Do this for all possible terms in the cluster. this will give you more n-dimensional documents.

Hope this helps!

Mikos · Answer 3 · 2010-04-02T00:57:47+0000

TF-IDF fulfills another purpose; if you are not going to reinvent the wheel, you better use a tool like Carrot . Google for document clustering can give you a lot of algorithms if you want to implement them yourself.

Tf-idf: Do I understand correctly?

More articles: