I am trying to build a dictionary of words using tf-idf. However, intuitively this does not make sense.
If the Inf Document Document (Idf) part of the Tf-Idf calculates the relevance of the term to the entire body, then this means that some important words may have lower relevance.
If we look at the composition of legal documents, a term such as “License” or “Jurisprudence” may appear in each document. Due to Idf, the rating for these conditions will be very low. However, intuitively, these terms should have a higher score, since they are clearly legal conditions.
Is tf-idf a bad approach to building a glossary of terms?
source share