You represent the terms that appear in documents as weight in a vector, where each index position is a “weight” for the term. For example, if we accept the document “hello world”, and we associate position 0 with the importance of “hello” and position 1 with the importance of peace, and we measure importance as the number of times the term appears, the document is considered as d = (1, 1) .
At the same time, a document saying only “hi” will be (1, 0).
This view may be based in some way on the importance of terms in documents that are frequency terms (as suggested by @Pedrom) the easiest option. The most common, but fairly simple method is to use TF-IDF , which combines how common a term is in a document and how rare a collection is in a document.
I hope this helps,
source share