I have a large repository of PDF documents. Documents come from different sources and do not have a single style. I use Tika to extract text from documents, and now I would like to segment the text into paragraphs.
I cannot use regular expressions because the documents do not have a single style:
- The number
\nlbetween paragraphs varies from 2 to 4. - In some documents, lines in one paragraph are divided by 2
\nl, and some with one \nl.
So, I turn to machine learning. Python NLTK's (large) book has an excellent use of classifying sentence segmentation using attributes such as before and after characters. with a Bayesian network, but without paragraph segmentation.
So my questions are:
- Is there any other way to segment segmentation?
- If I go with machine learning, are there tagged segmented paragraph data that I can use for training?
source
share