Keep only keywords for a domain?

I am trying to identify the most popular keywords for a particular class of documents in my collection. Assuming the domain is “computer science” (which, of course, includes networking, computer architecture, etc.), What is the best way to save these domain-related keywords from text? I tried to use Wordnet, but I do not quite understand how best to use it to extract this information.

Is there any known list of words that I can use as a white list, given the fact that I do not know all the keywords for the domain in advance? Or are there good nlp / machine learning methods for identifying domain keywords?

+4
source share
2 answers

You need a huge set of documents. A small subset of this collection (but still a large set of documents) should represent this domain. Using nltk to calculate word statistics based on morphology, filter out stop words. A good statistic is TF * IDF, which is approximately a series of word events in a subset of a domain divided by the number of documents containing the word in the entire collection. Keywords are the words with the highest TF * IDF.

+5
source

I used economical language models (LMs, 1 , 3 ) with some success on similar tasks; these individual terms relating to specific documents from the general conditions. They are known to be stronger than tf-idf statistics, but require a parameter to be set when they are installed.

You can find my Python implementation here ; use it, combine all your documents for each topic into one document, and then build ParsimoniousLM from different topics and get .top(K) conditions for each document.

+3
source

Source: https://habr.com/ru/post/1379379/


All Articles