Definition of important words and phrases in the text

I have text stored in a python string.

What I want

  • To identify the keywords in this text.
  • to identify N-grams in this text (ideally more than just two and three grams).

Keep in mind...

  • Text may be small (e.g. tweet size)
  • The text may be medium (for example, the size of a news article)
  • The text may be large (for example, the size of a book or chapter)

What i have

I already use nltk to break the body into markers and remove stop words:

# split across any non-word character tokenizer = nltk.tokenize.RegexpTokenizer('[^\w\']+', gaps=True) # tokenize tokens = tokenizer.tokenize(text) # remove stopwords tokens = [w for w in tokens if not w in nltk.corpus.stopwords.words('english')] 

I know about BigramCollocationFinder and TrigramCollectionFinder, which does what I am looking for for these two cases.


I need advice for higher order n-grams, improving the results from BCF and TCF, and tips for a better way to identify the most unique keywords.

Many thanks!

source share
1 answer

As for the best way to identify the most unique keywords, tfidf is a common measure. Thus, you somehow integrate the search engine (or create a simple custom inverted index that is dynamic and keep the frequency term, document frequency) in order to efficiently and accurately calculate tfidf.

As for your N-grams, why don't you create your own parser using the β€œwindow” approach (a window has a length of N) that identifies, say, the most common of them? (just keep each N-gram as a key in a dictionary with a value of either frequency or rating (based on tfidf individual terms))



All Articles