Weakening unstructured text in NLTK

Question

Weakening unstructured text in NLTK

I tried a regex streamer, but I get hundreds of unrelated tokens. I'm just interested in the "game." Here is the code I'm working with:

import nltk from nltk.book import * f = open('tupac_original.txt', 'rU') text = f.read() text1 = text.split() tup = nltk.Text(text1) lowtup = [w.lower() for w in tup if w.isalpha()] import sys, re tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')] from nltk import stem tupstem = stem.RegexpStemmer('az$|as$|a$') [tupstem.stem(i) for i in tupclean]

The result is higher:

 ['like', 'ed', 'young', 'black', 'like'...]

I am trying to clear .txt files (all lowercase letters, delete stop words, etc.), normalize several spellings of a word into one, and execute the dist / count frequency. I know how to make FreqDist , but any suggestions as to where I am making the FreqDist ?

+6

tokenize nltk lemmatization text-analysis

user2221429 Sep 26 '13 at 18:49

source share

1 answer

alvas · Answer 1 · 2013-09-27T07:23:01+0000

NLTK has several pre-encoded known stem cells, see http://nltk.org/api/nltk.stem.html and an example is given below.

 >>> from nltk import stem >>> porter = stem.porter.PorterStemmer() >>> lancaster = stem.lancaster.LancasterStemmer() >>> snowball = stem.snowball.EnglishStemmer() >>> tokens = ['player', 'playa', 'playas', 'pleyaz'] >>> [porter(i) for i in tokens] >>> [porter.stem(i) for i in tokens] ['player', 'playa', 'playa', 'pleyaz'] >>> [lancaster.stem(i) for i in tokens] ['play', 'play', 'playa', 'pleyaz'] >>> [snowball.stem(i) for i in tokens] [u'player', u'playa', u'playa', u'pleyaz']

But what you probably need is some kind of regex,

 >>> from nltk import stem >>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$') >>> [rxstem.stem(i) for i in tokens] ['play', 'play', 'play', 'pley']

Weakening unstructured text in NLTK

More articles: