Tf-idf and previously invisible terms

Question

Tf-idf and previously invisible terms

TF-IDF (Frequency Frequency - Reverse Frequency of a Document) is the main element of information retrieval. However, this is not the right model, and it seems to collapse when new terms are introduced into the case. How do people deal with this when requests or new documents have new terms, especially if they are high-frequency. With the traditional alignment of cosines, this will not affect the overall match.

+3

algorithm statistics nlp tf-idf

Gregg lind Oct 21 '08 at 18:53

source share

2 answers

Jay Kominek · Answer 1 · 2008-10-31T22:11:20+0000

Er, no, it doesn't break.

Let's say I have two documents: “affectionate goat” and “cheese gopher” B. If we actually represented them as vectors, they could look something like this:

A [1,1,0,0]
B [0,0,1,1]

, , , . , . .

, , , . (A, B) "marmoset kungfu", , - . , , , . , , , .

Trochee · Answer 2 · 2008-12-31T22:15:25+0000

"", , , , , .

, , , _UNKNOWN_.

, "" ; , ? , , .

Tf-idf and previously invisible terms

More articles: