Parsing text into sentences?

I am trying to parse the text of a PDF page in sentences, but it is much more complicated than I expected. There are many special cases to consider, such as initials, decimals, quotes, etc., which contain periods but do not necessarily end the sentence.

I was curious if anyone here was familiar with the NLP library for C or C ++, which could help me deal with this task or just offer some advice?

Thanks for any help.

+3
source share
4 answers

, . Wikipedia , , C.

. Unicode Unicode № 29 - .

+6

(SBD) . , , , C ( , )

, - Unix , Windows , . , SBD , SBD, Z. ,

./pdfconvert | SBD | my_C_tool > ...

, , , , .

, ,

, . OpenNLP , . , , . , , , .

, SBD, . , , . , X, X . , .

, - , .

+3

, , . , . , , , , , PDF , ?

+2

I had the same requirements some time ago. I tried several solutions. The best ones were splitta ( http://code.google.com/p/splitta/ ). He coped well with all the extreme conditions that I threw at him. splitta python.

I also tried sentrick (java). http://www.denkselbst.de/sentrick/index.html

Unfortunately, I do not have a complete list of all the parameters that I have tried.

0
source

Source: https://habr.com/ru/post/1710081/


All Articles