Weakening unstructured text in NLTK

I tried a regex streamer, but I get hundreds of unrelated tokens. I'm just interested in the "game." Here is the code I'm working with:

import nltk from nltk.book import * f = open('tupac_original.txt', 'rU') text = f.read() text1 = text.split() tup = nltk.Text(text1) lowtup = [w.lower() for w in tup if w.isalpha()] import sys, re tupclean = [w for w in lowtup if not w in nltk.corpus.stopwords.words('english')] from nltk import stem tupstem = stem.RegexpStemmer('az$|as$|a$') [tupstem.stem(i) for i in tupclean] 

The result is higher:

 ['like', 'ed', 'young', 'black', 'like'...] 

I am trying to clear .txt files (all lowercase letters, delete stop words, etc.), normalize several spellings of a word into one, and execute the dist / count frequency. I know how to make FreqDist , but any suggestions as to where I am making the FreqDist ?

+6
source share
1 answer

NLTK has several pre-encoded known stem cells, see http://nltk.org/api/nltk.stem.html and an example is given below.

 >>> from nltk import stem >>> porter = stem.porter.PorterStemmer() >>> lancaster = stem.lancaster.LancasterStemmer() >>> snowball = stem.snowball.EnglishStemmer() >>> tokens = ['player', 'playa', 'playas', 'pleyaz'] >>> [porter(i) for i in tokens] >>> [porter.stem(i) for i in tokens] ['player', 'playa', 'playa', 'pleyaz'] >>> [lancaster.stem(i) for i in tokens] ['play', 'play', 'playa', 'pleyaz'] >>> [snowball.stem(i) for i in tokens] [u'player', u'playa', u'playa', u'pleyaz'] 

But what you probably need is some kind of regex,

 >>> from nltk import stem >>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$') >>> [rxstem.stem(i) for i in tokens] ['play', 'play', 'play', 'pley'] 
+11
source

Source: https://habr.com/ru/post/954712/


All Articles