NLTK has several pre-encoded known stem cells, see http://nltk.org/api/nltk.stem.html and an example is given below.
>>> from nltk import stem >>> porter = stem.porter.PorterStemmer() >>> lancaster = stem.lancaster.LancasterStemmer() >>> snowball = stem.snowball.EnglishStemmer() >>> tokens = ['player', 'playa', 'playas', 'pleyaz'] >>> [porter(i) for i in tokens] >>> [porter.stem(i) for i in tokens] ['player', 'playa', 'playa', 'pleyaz'] >>> [lancaster.stem(i) for i in tokens] ['play', 'play', 'playa', 'pleyaz'] >>> [snowball.stem(i) for i in tokens] [u'player', u'playa', u'playa', u'pleyaz']
But what you probably need is some kind of regex,
>>> from nltk import stem >>> rxstem = stem.RegexpStemmer('er$|a$|as$|az$') >>> [rxstem.stem(i) for i in tokens] ['play', 'play', 'play', 'pley']
source share