Apostrophe splitting prevention when using words using nltk

I use nltk to separate sentences into words. eg

  nltk.word_tokenize("The code didn't work!") -> ['The', 'code', 'did', "n't", 'work', '!'] 

Tokenization works well when smoothing word boundaries [i.e. division of punctuation into words], but sometimes redistribution, and modifiers at the end of a word are processed as separate parts. For example, didn't split into did and n't parts, and i've divided into I and 've . Obviously, this is due to the fact that such words are divided into two parts in the original package, which nltk uses, and may be desirable in some cases.

Is there any built-in way to overcome this behavior? It is possible, similarly to how nltk's MWETokenizer is able to aggregate several words into phrases, but in this case it is simple to aggregate dictionary components into words.

Alternatively, is there another tokenizer that does not break the vocabulary parts?

+5
source share
1 answer

Actually works as expected :

This is the correct / expected conclusion. For verbal tokenization, abbreviations are considered two words, since they make sense.

Various nltk handle English decompositions differently. For example, I found that TweetTokenizer does not split the compression into two parts:

 >>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> tknzr.tokenize("The code didn't work!") [u'The', u'code', u"didn't", u'work', u'!'] 

Please see additional information and workarounds at:

+7
source

Source: https://habr.com/ru/post/1240310/


All Articles