nltk.tokenize.word_tokenize(text) is just a subtle wrapper function that calls the tokenize method of the tokenize instance, which apparently uses a simple regular expression to parse the sentence.
The documentation for this class states that:
This tokenizer assumes the text is already segmented by sentences. Any periods - except those at the end of the line - are considered part of the word to which they are attached (for example, for the abbreviation, etc.) and are not separately indicated.
The main tokenize method tokenize very simple:
def tokenize(self, text): for regexp in self.CONTRACTIONS2: text = regexp.sub(r'\1 \2', text) for regexp in self.CONTRACTIONS3: text = regexp.sub(r'\1 \2 \3', text)
Basically, what this method usually does is tokenize this period as a separate token if it falls to the end of the line:
>>> nltk.tokenize.word_tokenize("Hello, world.") ['Hello', ',', 'world', '.']
Any periods that fall inside the line are indicated as part of the word, assuming it is an abbreviation:
>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") ['Hello', ',', 'world.', 'How', 'are', 'you', '?']
As long as this behavior is acceptable, you should be fine.