Separate documents with several paragraphs in sentences with numbers in paragraphs

I have a list of well-analyzed documents with several paragraphs (all paragraphs are separated by \ n \ n and sentences separated by "."), Which I would like to divide into sentences along with a number indicating the paragraph number inside the document. For example, an input (two paragraphs):

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n 

Ideally, the conclusion should be:

 1 First sentence of the 1st paragraph. 1 Second sentence of the 1st paragraph. 2 First sentence of the 2nd paragraph. 2 Second sentence of the 2nd paragraph. 

I am familiar with the Lingua :: Sentences package in Perl, which can split documents into sentences. However, it is incompatible with paragraph numbering. Therefore, I wonder if there is an alternative way to achieve the above (there are no abbreviations in the documents). Any help is appreciated. Thanks!

+4
source share
2 answers

As you mentioned Lingua::Sentences , I think this is an opportunity to slightly modify the initial output from this module to get what you need

 use Lingua::Sentence; my @paragraphs = split /\n{2,}/, $splitter->split($text); foreach my $index (0..$#paragraphs) { my $paragraph = join "\n\n", map { $index+1 . " $_" } split /\n/, $paragraphs[$index]; print "$paragraph\n\n"; } 
+2
source

If you can rely on a period . being a separator, you can do this:

 perl -00 -nlwe 'print qq($. $_) for split /(?<=\.)/' yourfile.txt 

Explanation:

  • -00 sets the input separator to an empty line, which is the paragraph mode.
  • -l sets the output record separator to the input record separator, which in this case is converted to two lines of a new line.

Then we simply break into the period with the lookbehind statement and print the sentences preceded by the line number.

+5
source

Source: https://habr.com/ru/post/1496432/


All Articles