SMS text normalization

I am looking for a good library or some project that has been implemented in the field of normalizing SMS text. I found some good research projects like this one .

I use Java as a programming language.

The concept in a nutshell is to process text based on SMS, for example, " tel him 2 go home nw " and convert it into plain English text " tell him to return home strong>".

+4
source share
3 answers

Why not just download the dictionary from such a site: http://smsdictionary.co.uk/abbreviations and use line replacement?

+4
source

Substitution of the dictionary does not shorten it, because it skips the context in translations. for example, do you translate β€œ2” to β€œby,” β€œtoo,” or β€œtwo?”

You can get a corpus and prepare a statistical model yourself using Moses (http://www.statmt.org/moses/) or Phrasal (http://nlp.stanford.edu/software/phrasal/).

As the author of Stanford one (http://www-nlp.stanford.edu/sms/translate.php), I could be sure that I offer a REST-based API for such a service, but I don’t know the demand for it ...

+3
source

Normalization of the text. The text normalization module organizes the input text into the source lists of conve- words, i.e. Finds numbers, abbreviations, abbreviations and idiomatic expressions and expands them to the full text. This is usually done using regular grammars.

Pronunciation of words: after a sequence of words has been generated using the normal- text module, their pronunciation can be interrupted by de-. A simple rule for converting letters to sound (LTS) can be applied when words are spoken the way they are written. Where this is not the case, a morphosyntactic analyzer can be used. The morpho-parser parses speech with various identities, such as prefixes, roots, and suffixes, and organizes sentences into syntactically related groups of words, such as nouns, verbs, and adjectives. Their pronuncia- can then be determined using the lexicon.

0
source

Source: https://habr.com/ru/post/1381087/


All Articles