In tf-idf, why do we normalize by the frequency of the document, and not the average frequency of all documents in the case?

Question

In tf-idf, why do we normalize by the frequency of the document, and not the average frequency of all documents in the case?

The weighted average frequency will be the average frequency that appears in other documents. Intuitively, I want to compare how often it appears in this document regarding other documents in the corpus.

Example:

d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have a set of words
d1 has the word “theory” 100 times, d2 has the word “theory” 100 times, d3 has the word “theory” 100 times, d4-N does not have a set of words

Document 1 has the same tf-idf for the word "set" and the word "theory", although word-to-word is more important for d1 than theory.

Using the average time frequency, these two examples could be distinguished. Is tf-iatf (inverse mid-frequency) a valid approach? For me, this would give me more important keywords, rather than just “rare” and “unique” keywords. If idf is an “estimate of how rare this word is” would not be a better estimate? It seems like it's a little harder to implement (especially if the data is preprocessed).

I'm going to run an experiment and manually analyze keywords with the highest rank with each measure, but at first I wanted to convey it with different eyes.

Next question: Why is tf-idf used so often, unlike alternative methods like this that MAY be more accurate? (If this is a valid approach).

: , , , iatf ( , ) .

+4

information-retrieval tf-idf

stogers 18 . '16 20:04

1

Debasis · Accepted Answer · 2016-02-19T01:07:52+0000

Tf-idf . ( ).

, , : , cf (t) , df (t).

cf (t) , t. cf (t), , t .

tf (t, d) cf (t), t , .

P(t,d) = \lambda P(t|d) + (1-\lambda) P(t|Collection)

Jelinek Mercer.

(letting\lambda = 0.5):

Corpus 1: P("set",d1) = 0.5*100/100 + 0.5*100/102

Corpus 2: P("set",d1) = 0.5*100/100 + 0.5*100/300

, P ( "", d1) 2 ( ) , 1.

In tf-idf, why do we normalize by the frequency of the document, and not the average frequency of all documents in the case?

More articles: