Split text into sentences

I want to split the text into sentences. Can anybody help me?

I also need to handle abbreviations. However, my plan is to replace them at an earlier stage. Mr. → Mr.

import re import unittest class Sentences: def __init__(self,text): self.sentences = tuple(re.split("[.!?]\s", text)) class TestSentences(unittest.TestCase): def testFullStop(self): self.assertEquals(Sentences("XX").sentences, ("X.","X.")) def testQuestion(self): self.assertEquals(Sentences("X? X?").sentences, ("X?","X?")) def testExclaimation(self): self.assertEquals(Sentences("X! X!").sentences, ("X!","X!")) def testMixed(self): self.assertEquals(Sentences("X! X? X! X.").sentences, ("X!", "X?", "X!", "X.")) 

Thanks Barry

EDIT: For starters, I would be happy to satisfy the four tests that I included above. This will help me better understand how regular expressions work. At the moment, I can define the sentence as X. etc., as defined in my tests.

+4
source share
1 answer

Segmenting sentences can be a very difficult task, especially when the text contains dotted abbreviations. their recognition may require the use of lists of known abbreviations or a training classifier.

I suggest you use NLTK, a set of open source Python modules designed to handle natural language.

You can read about Offer Segmentation using NLTK here , and decide for yourself if this tool is right for you.

EDITED: or even simpler here , and here is the source code . This is the Punkt offer tokenizer included in the NLTK.

+5
source

Source: https://habr.com/ru/post/1368935/


All Articles