The weighted average frequency will be the average frequency that appears in other documents. Intuitively, I want to compare how often it appears in this document regarding other documents in the corpus.
Example:
- d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have a set of words
- d1 has the word “theory” 100 times, d2 has the word “theory” 100 times, d3 has the word “theory” 100 times, d4-N does not have a set of words
Document 1 has the same tf-idf for the word "set" and the word "theory", although word-to-word is more important for d1 than theory.
Using the average time frequency, these two examples could be distinguished. Is tf-iatf (inverse mid-frequency) a valid approach? For me, this would give me more important keywords, rather than just “rare” and “unique” keywords. If idf is an “estimate of how rare this word is” would not be a better estimate? It seems like it's a little harder to implement (especially if the data is preprocessed).
I'm going to run an experiment and manually analyze keywords with the highest rank with each measure, but at first I wanted to convey it with different eyes.
Next question: Why is tf-idf used so often, unlike alternative methods like this that MAY be more accurate? (If this is a valid approach).
:
, , , iatf ( , ) .