Term. Frequency is the square root of the number of times a term appears in a particular document.
Frequency of the reverse document (journal (total number of documents divided by the number of documents containing this term)) plus one if this term has zero time - if so, obviously, do not try to divide by zero.
If it is not clear from this answer whether there is TF for each document per document and IDF per period.
And then TF-IDF (term, document) = TF (term, document) * IDF (term)
Finally, you use a vector space model to compare documents, where each term is a new dimension, and the “length” of the part of the vector that indicates in that dimension is a TF-IDF calculation. Each document is a vector, so calculate two vectors, and then calculate the distance between them.
, Java, FileReader - , , , - . , , , . , .
, . :
D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)
x1 TF-IDF x 1.
: , :
- ,
new BufferedReader(new FileReader(filename)) - BufferedReader.readLine() while, null. line.split("\\s") -, .- 1 . ,
HashMap.
, D , X, X - . , X ^ 2 - 10 000. , , D . , Ds - , . ?