Keep only keywords for a domain?

Question

Keep only keywords for a domain?

I am trying to identify the most popular keywords for a particular class of documents in my collection. Assuming the domain is “computer science” (which, of course, includes networking, computer architecture, etc.), What is the best way to save these domain-related keywords from text? I tried to use Wordnet, but I do not quite understand how best to use it to extract this information.

Is there any known list of words that I can use as a white list, given the fact that I do not know all the keywords for the domain in advance? Or are there good nlp / machine learning methods for identifying domain keywords?

+4

python machine-learning nlp nltk

Legend Nov 02 '11 at 20:27

source share

2 answers

I used economical language models (LMs, 1 , 3 ) with some success on similar tasks; these individual terms relating to specific documents from the general conditions. They are known to be stronger than tf-idf statistics, but require a parameter to be set when they are installed.

You can find my Python implementation here ; use it, combine all your documents for each topic into one document, and then build ParsimoniousLM from different topics and get .top(K) conditions for each document.

+3

Fred foo Nov 03 '11 at 12:23

source share

Andrey · Accepted Answer · 2011-11-02T22:00:56+0000

You need a huge set of documents. A small subset of this collection (but still a large set of documents) should represent this domain. Using nltk to calculate word statistics based on morphology, filter out stop words. A good statistic is TF * IDF, which is approximately a series of word events in a subset of a domain divided by the number of documents containing the word in the entire collection. Keywords are the words with the highest TF * IDF.

Keep only keywords for a domain?

More articles: