I use nltk to separate sentences into words. eg
nltk.word_tokenize("The code didn't work!") -> ['The', 'code', 'did', "n't", 'work', '!']
Tokenization works well when smoothing word boundaries [i.e. division of punctuation into words], but sometimes redistribution, and modifiers at the end of a word are processed as separate parts. For example, didn't split into did and n't parts, and i've divided into I and 've . Obviously, this is due to the fact that such words are divided into two parts in the original package, which nltk uses, and may be desirable in some cases.
Is there any built-in way to overcome this behavior? It is possible, similarly to how nltk's MWETokenizer is able to aggregate several words into phrases, but in this case it is simple to aggregate dictionary components into words.
Alternatively, is there another tokenizer that does not break the vocabulary parts?
source share