Paragraph segmentation using machine learning

I have a large repository of PDF documents. Documents come from different sources and do not have a single style. I use Tika to extract text from documents, and now I would like to segment the text into paragraphs.

I cannot use regular expressions because the documents do not have a single style:

  • The number \nlbetween paragraphs varies from 2 to 4.
  • In some documents, lines in one paragraph are divided by 2 \nl, and some with one \nl.

So, I turn to machine learning. Python NLTK's (large) book has an excellent use of classifying sentence segmentation using attributes such as before and after characters. with a Bayesian network, but without paragraph segmentation.

So my questions are:

  • Is there any other way to segment segmentation?
  • If I go with machine learning, are there tagged segmented paragraph data that I can use for training?
+4
source share

Source: https://habr.com/ru/post/1667528/


All Articles