Problem
Unfortunately, you cannot force the tokenizer to save forms, but not in the way it is written.
Starting here and after function calls through span_tokenize () and _slices_from_text (), you can see that there is a condition
if match.group('next_tok'):
which is designed to let the tokenizer skip spaces until the next possible sentence begins, starting with the token. In search of a regular expression, this means that we look at _ period_context_fmt , where we see that the group named next_tok preceded by \s+ , where the blanks will not be captured.
Decision
Break it up, change the part you donβt like, re-create your own solution.
This regular expression is now in the PunktLanguageVars class, which is used to initialize the PunktSentenceTokenizer . We just need to get the custom class from PunktLanguageVars and fix the regular expression the way we want.
The fix we want is to include trailing newlines at the end of the sentence, so I suggest replacing _period_context_fmt based on this:
_period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending (?=(?P<after_tok> %(NonWord)s # either other punctuation | \s+(?P<next_tok>\S+) # or whitespace and some other token ))"""
:
_period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))"""
Now the tokenizer using this regular expression instead of the highest will contain 0 or more \s characters after the end of the sentence.
Whole script
import nltk.tokenize.punkt as pkt class CustomLanguageVars(pkt.PunktLanguageVars): _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))""" custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" print(custom_tknzr.tokenize(s))
It is output:
['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']