I have text stored in a python string.
What I want
- To identify the keywords in this text.
- to identify N-grams in this text (ideally more than just two and three grams).
Keep in mind...
- Text may be small (e.g. tweet size)
- The text may be medium (for example, the size of a news article)
- The text may be large (for example, the size of a book or chapter)
What i have
I already use nltk to break the body into markers and remove stop words:
# split across any non-word character tokenizer = nltk.tokenize.RegexpTokenizer('[^\w\']+', gaps=True) # tokenize tokens = tokenizer.tokenize(text) # remove stopwords tokens = [w for w in tokens if not w in nltk.corpus.stopwords.words('english')]
I know about BigramCollocationFinder and TrigramCollectionFinder, which does what I am looking for for these two cases.
Question
I need advice for higher order n-grams, improving the results from BCF and TCF, and tips for a better way to identify the most unique keywords.
Many thanks!
source share