Keep blank lines with Tokenizer NLTK Punkt

Question

Keep blank lines with Tokenizer NLTK Punkt

I use the NLTK PUNKT offer tokenizer to split the file into a list of sentences and would like to save empty lines in the file:

from nltk import data tokenizer = data.load('tokenizers/punkt/english.pickle') s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" sentences = tokenizer.tokenize(s) print sentences

I would like this to print:

 ['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that is actually printed shows that the trailing blank lines were removed from the first and third sentences:

 ['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokens in NLTK have the blanklines='keep' parameter, but I don’t see this option in the case of the Punkt tokenizer. It is very possible that I missed something simple. Is there a way to reinstall these trailing blank lines using the Punkt offer tokenizer? I would be grateful for any information that others have to offer!

+5

python newline nlp nltk line-breaks

duhaime Oct 15 '15 at 3:46

source share

4 answers

Split the input into paragraphs, splitting into regexp capture (which also returns the captured line):

 paras = re.split("(\n\s*\n)", sentences)

Then you can apply nltk.sent_tokenize() to individual paragraphs and process the results by paragraph or flatten the list - which is best for your future use.

 sents_by_para = [ nltk.sent_tokenize(p) for p in paras ] flat = [ sent for par in sents_by_para for sent in par ]

(It seems that sent_tokenize() does not cripple lines only for spaces, so there is no need to check and exclude them from processing.)

If you specifically want the space to be attached to the previous sentence, you can easily include it:

 collapsed = [] for s in flat: if s.isspace() and len(collapsed) > 0: collapsed[-1] += s else: collapsed.append(s)

+1

alexis Oct 16 '15 at 23:39

source share

I would go with itertools.groupby , see Python: how to itertools.groupby over blocks of strings :

 alvas@ubi :~$ echo """This is a foo bar sentence, that is also a foo bar sentence. But I don't like foobars. Yes you do like bars with foos, no? I'm not sure whether you like bar bar! Neither do I like black sheep.""" > test.in alvas@ubi :~$ python >>> from nltk import sent_tokenize >>> import itertools >>> with open('test.in', 'r') as fin: ... for key, group in itertools.groupby(fin, lambda x: x!='\n'): ... if key: ... print list(group) ... ['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n'] ["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n'] ["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

And after that, if you want to do sent_tokenize or other punkt models inside the group:

 >>> with open('test.in', 'r') as fin: ... for key, group in itertools.groupby(fin, lambda x: x!='\n'): ... if key: ... paragraph = " ".join(line.strip() for line in group) ... print sent_tokenize(paragraph) ... ['This is a foo bar sentence, that is also a foo bar sentence.'] ["But I don't like foobars.", 'Yes you do like bars with foos, no?'] ["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

(Note: using mmap will be more efficient from a computational point of view, see fooobar.com/questions/874668 / .... But for the size I'm working on (~ 20 million tokens) itertools.groupby was enough)

0

alvas Oct 15 '15 at 18:22

source share

In the end, I ended up combining concepts with both @alexis and @HugoMailhot so that I could keep line breaks in cases where one paragraph contains several sentences and / or lines:

 import re, nltk, sys, codecs import nltk.tokenize.punkt as pkt from nltk import data class CustomLanguageVars(pkt.PunktLanguageVars): _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))""" custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) def sentence_split(s): '''Read in a string and return a list of sentences with linebreaks intact''' paras = re.split("(\n\s*\n)", s) sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ] flat = [ sent for par in sents_by_para for sent in par ] collapsed = [] for s in flat: if s.isspace() and len(collapsed) > 0: collapsed[-1] += s else: collapsed.append(s) return collapsed if __name__ == "__main__": s = codecs.open(sys.argv[1],'r','utf-8').read() sentences = sentence_split(s)

0

duhaime Oct 31 '15 at 20:23

source share

Hugohothot · Accepted Answer · 2015-10-15T16:11:03+0000

Problem

Unfortunately, you cannot force the tokenizer to save forms, but not in the way it is written.

Starting here and after function calls through span_tokenize () and _slices_from_text (), you can see that there is a condition

if match.group('next_tok'):

which is designed to let the tokenizer skip spaces until the next possible sentence begins, starting with the token. In search of a regular expression, this means that we look at _ period_context_fmt , where we see that the group named next_tok preceded by \s+ , where the blanks will not be captured.

Decision

Break it up, change the part you don’t like, re-create your own solution.

This regular expression is now in the PunktLanguageVars class, which is used to initialize the PunktSentenceTokenizer . We just need to get the custom class from PunktLanguageVars and fix the regular expression the way we want.

The fix we want is to include trailing newlines at the end of the sentence, so I suggest replacing _period_context_fmt based on this:

 _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending (?=(?P<after_tok> %(NonWord)s # either other punctuation | \s+(?P<next_tok>\S+) # or whitespace and some other token ))"""

:

 _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))"""

Now the tokenizer using this regular expression instead of the highest will contain 0 or more \s characters after the end of the sentence.

Whole script

 import nltk.tokenize.punkt as pkt class CustomLanguageVars(pkt.PunktLanguageVars): _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))""" custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" print(custom_tknzr.tokenize(s))

It is output:

 ['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

Keep blank lines with Tokenizer NLTK Punkt

Problem

Decision

Whole script

More articles: