Tf-idf and previously invisible terms

TF-IDF (Frequency Frequency - Reverse Frequency of a Document) is the main element of information retrieval. However, this is not the right model, and it seems to collapse when new terms are introduced into the case. How do people deal with this when requests or new documents have new terms, especially if they are high-frequency. With the traditional alignment of cosines, this will not affect the overall match.

+3
source share
2 answers

Er, no, it doesn't break.

Let's say I have two documents: “affectionate goat” and “cheese gopher” B. If we actually represented them as vectors, they could look something like this:

A [1,1,0,0]
B [0,0,1,1]

, , , . , . .

, , , . (A, B) "marmoset kungfu", , - . , , , . , , , .

+2

"", , , , , .

, , , _UNKNOWN_.

, "" ; , ? , , .

+1

Source: https://habr.com/ru/post/1698532/


All Articles