Definition of important words and phrases in the text

Question

Definition of important words and phrases in the text

I have text stored in a python string.

What I want

To identify the keywords in this text.
to identify N-grams in this text (ideally more than just two and three grams).

Keep in mind...

Text may be small (e.g. tweet size)
The text may be medium (for example, the size of a news article)
The text may be large (for example, the size of a book or chapter)

What i have

I already use nltk to break the body into markers and remove stop words:

# split across any non-word character tokenizer = nltk.tokenize.RegexpTokenizer('[^\w\']+', gaps=True) # tokenize tokens = tokenizer.tokenize(text) # remove stopwords tokens = [w for w in tokens if not w in nltk.corpus.stopwords.words('english')]

I know about BigramCollocationFinder and TrigramCollectionFinder, which does what I am looking for for these two cases.

Question

I need advice for higher order n-grams, improving the results from BCF and TCF, and tips for a better way to identify the most unique keywords.

Many thanks!

+6

python nlp nltk

slifty Oct 6 '11 at 16:06

source share

1 answer

hymloth · Answer 1 · 2011-10-06T16:33:54+0000

As for the best way to identify the most unique keywords, tfidf is a common measure. Thus, you somehow integrate the search engine (or create a simple custom inverted index that is dynamic and keep the frequency term, document frequency) in order to efficiently and accurately calculate tfidf.

As for your N-grams, why don't you create your own parser using the “window” approach (a window has a length of N) that identifies, say, the most common of them? (just keep each N-gram as a key in a dictionary with a value of either frequency or rating (based on tfidf individual terms))

Definition of important words and phrases in the text

More articles: