Neither the streamer nor the lemmatizer can get you from greatest
→ great
:
>>> from nltk.stem import WordNetLemmatizer >>> from nltk.stem import WordNetLemmatizer, PorterStemmer >>> porter = PorterStemmer() >>> wnl = WordNetLemmatizer() >>> greatest = 'greatest' >>> porter.stem(greatest) u'greatest' >>> wnl.lemmatize(greatest) 'greatest' >>> greater = 'greater' >>> wnl.lemmatize(greater) 'greater' >>> porter.stem(greater) u'greater'
But it looks like you can use some nice PennTreeBank tag properties to get from greatest -> great
:
>>> from nltk import pos_tag >>> pos_tag(['greatest']) [('greatest', 'JJS')] >>> pos_tag(['greater']) [('greater', 'JJR')] >>> pos_tag(['great']) [('great', 'JJ')]
Try a rule-based crazy system, let it start with greatest
:
>>> import re >>> word1 = 'greatest' >>> re.sub('est$', '', word1) 'great' >>> re.sub('est$', 'er', word1) 'greater' >>> pos_tag([re.sub('est$', '', word1)])[0][1] 'JJ' >>> pos_tag([re.sub('est$', 'er', word1)])[0][1] 'JJR' >>> word1 'greatest'
Now that we know that we can build our own excellent stemmer / lemmatizer / tail _substituter, write a rule that says that if a word gives an excellent POS tag and our tail_substituter
gives us JJ when we start and JJR when we we can say with confidence that the comparative and basic form of a word can be easily obtained using our tail_substituter
:
>>> if pos_tag([word1])[0][1] == 'JJS' \ ... and pos_tag([re.sub('est$', '', word1)])[0][1] == 'JJ' \ ... and pos_tag([re.sub('est$', 'er', word1)])[0][1] == 'JJR': ... comparative = re.sub('est$', 'er', word1) ... adjective = re.sub('est$', '', word1) ... >>> adjective 'great' >>> comparative 'greater'
Now you get from greatest -> greater -> great
. From great -> best
is kind of weird because they are not lexically related to each other, although a relative of relatives seems related.
So, I think it would be subjective to say that great -> best
is a valid conversion