I am trying to parse the text of a PDF page in sentences, but it is much more complicated than I expected. There are many special cases to consider, such as initials, decimals, quotes, etc., which contain periods but do not necessarily end the sentence.
I was curious if anyone here was familiar with the NLP library for C or C ++, which could help me deal with this task or just offer some advice?
Thanks for any help.
, . Wikipedia , , C.
. Unicode Unicode № 29 - .
(SBD) . , , , C ( , )
, - Unix , Windows , . , SBD , SBD, Z. ,
./pdfconvert | SBD | my_C_tool > ...
, , , , .
, ,
, . OpenNLP , . , , . , , , .
, SBD, . , , . , X, X . , .
, - , .
, , . , . , , , , , PDF , ?
I had the same requirements some time ago. I tried several solutions. The best ones were splitta ( http://code.google.com/p/splitta/ ). He coped well with all the extreme conditions that I threw at him. splitta python.
I also tried sentrick (java). http://www.denkselbst.de/sentrick/index.html
Unfortunately, I do not have a complete list of all the parameters that I have tried.
Source: https://habr.com/ru/post/1710081/More articles:How to set ASP.NET Ajax ModalPopupExtender position? - javascriptUnderstand the command pattern in Swing - java"Cannot convert parameter" using iterator boost :: variant - c ++Updates without OTA Android - androidRecord AVAudioPlayer using AVAudioRecorder - iphonewhen was the index statistics updated? - sql-serverhttps://translate.googleusercontent.com/translate_c?depth=1&pto=aue&rurl=translate.google.com&sl=ru&sp=nmt4&tl=en&u=https://fooobar.com/questions/1710083/should-my-framework-allow-access-to-get-and-post-at-the-same-time&usg=ALkJrhhO-3Dgj8MTc4hOMyuIHf909Hc4ugIs there a way to tell Google some of the elements are inappropriate for the page? - google-apiStyle table classes using jQuery - jqueryBest way to implement animation on iPhone SDK? - iphoneAll Articles