Split text into sentences

Question

Split text into sentences

I want to split the text into sentences. Can anybody help me?

I also need to handle abbreviations. However, my plan is to replace them at an earlier stage. Mr. → Mr.

import re import unittest class Sentences: def __init__(self,text): self.sentences = tuple(re.split("[.!?]\s", text)) class TestSentences(unittest.TestCase): def testFullStop(self): self.assertEquals(Sentences("XX").sentences, ("X.","X.")) def testQuestion(self): self.assertEquals(Sentences("X? X?").sentences, ("X?","X?")) def testExclaimation(self): self.assertEquals(Sentences("X! X!").sentences, ("X!","X!")) def testMixed(self): self.assertEquals(Sentences("X! X? X! X.").sentences, ("X!", "X?", "X!", "X."))

Thanks Barry

EDIT: For starters, I would be happy to satisfy the four tests that I included above. This will help me better understand how regular expressions work. At the moment, I can define the sentence as X. etc., as defined in my tests.

+4

python python-3.x regex text-segmentation

Baz Aug 25 '11 at 10:17

source share

1 answer

Ido.Co · Accepted Answer · 2011-08-25T10:32:33+0000

Segmenting sentences can be a very difficult task, especially when the text contains dotted abbreviations. their recognition may require the use of lists of known abbreviations or a training classifier.

I suggest you use NLTK, a set of open source Python modules designed to handle natural language.

You can read about Offer Segmentation using NLTK here , and decide for yourself if this tool is right for you.

EDITED: or even simpler here , and here is the source code . This is the Punkt offer tokenizer included in the NLTK.

Split text into sentences

More articles: