Intuition behind Tf-idf to highlight the term

I am trying to build a dictionary of words using tf-idf. However, intuitively this does not make sense.

If the Inf Document Document (Idf) part of the Tf-Idf calculates the relevance of the term to the entire body, then this means that some important words may have lower relevance.

If we look at the composition of legal documents, a term such as “License” or “Jurisprudence” may appear in each document. Due to Idf, the rating for these conditions will be very low. However, intuitively, these terms should have a higher score, since they are clearly legal conditions.

Is tf-idf a bad approach to building a glossary of terms?

+5
source share
1 answer

Yes, these terms are legal terms. However, the TF / IDF does not attempt to evaluate whether they are relevant for a particular domain. They help you with crushing documents from this domain. If a term of type legal appears in each document, it will not help the classifier to separate these documents. However, if you mix your legal documents with a random set of documents. You will find that they suddenly become extremely important. Precisely because they allow you to separate legal documents and other documents.

In practice, they are more often used to remove words like "beautiful." For instance. The occurs in every document and does not make any difference.

Whether TF / IDF is good for creating a dictionary is highly dependent on what you want to do after this dictionary.

+4
source

Source: https://habr.com/ru/post/1243254/


All Articles