Rails gem to break a paragraph into a series of sentences

I am trying to divide a paragraph into a series of sentences, so that each group of sentences remains under N characters. In the case of one sentence that is greater than N, it should be divided into pieces with punctuation marks or spaces as delimiters.

For example, if N = 50, then the next line

"Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."

will become

["Lorem ipsum, consectetur elit. Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin", "sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel "," tortor. " ]

Are there any rail stones that could help me achieve this? I looked at html_slicer , but I'm not sure if it can handle the above example.

+4
source share
1 answer

There are two non-trivial tasks to achieve what you need:

  • splitting a string into sentences
  • and word wrap of each sentence with extreme caution in punctuation.

I think the first one is not easy to implement from scratch, so itโ€™s best to use natural language processing libraries if your โ€œthird-party language processing serviceโ€ doesnโ€™t have such a function. I do not know any "rail stones" to satisfy your requirements.

Here's just a toy example of breaking a string into sentences using stanford-core-nlp .

require 'stanford-core-nlp' text = "Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor." pipeline = StanfordCoreNLP.load(:tokenize, :ssplit) a = StanfordCoreNLP::Annotation.new(text) pipeline.annotate(a) sentenses = a.get(:sentences).to_a.map &:to_s # Map with to_s if you want an array of sentence string. # => ["Lorem ipsum, consectetur elit.", "Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin, sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel tortor."] 

The second problem is similar to word wrap, and if it is definitely a word wrap problem, it is easy to solve using existing implementations such as ActionView :: Helpers :: TextHelper.word_wrap. However, there is an additional requirement for punctuation. I do not know any existing implementation to achieve exactly the same goal. You may need to come up with your own solution.

My only idea is to firstly wrap each sentence, secondly to separate each line with punctuation, and then combine the parts again, but with a length limit. I wonder if this will work.

+1
source

Source: https://habr.com/ru/post/1483996/


All Articles