Paragraph segmentation using machine learning

Question

Paragraph segmentation using machine learning

I have a large repository of PDF documents. Documents come from different sources and do not have a single style. I use Tika to extract text from documents, and now I would like to segment the text into paragraphs.

I cannot use regular expressions because the documents do not have a single style:

The number \nlbetween paragraphs varies from 2 to 4.
In some documents, lines in one paragraph are divided by 2 \nl, and some with one \nl.

So, I turn to machine learning. Python NLTK's (large) book has an excellent use of classifying sentence segmentation using attributes such as before and after characters. with a Bayesian network, but without paragraph segmentation.

So my questions are: