Statistical word division approach

I want to solve the problem of word splitting (parsing words from a long line without spaces). For example, we want to extract words from somelongword to [some, long, word] .

We can achieve this with some dynamic vocabulary approach, but another problem we face is the analysis of ambiguity. That is, orcore => or core or orc ore (We do not take into account the meaning of the phrase or part of speech). So I'm thinking of using some kind of statistical approach or ML approach.

I found that the Naive Bayes and Viterbi algorithm with a set of trains can be used to solve this. Can you tell me some information about applying these algorithms to the word splitting problem?

UPD: I implemented this method on Clojure using some advice from Peter Norvig code

+6
source share
2 answers

I think that Peter Norwig and Sebastian Turn's slideshow is a good place to start. It represents the real work done by google.

+4
source

This problem is completely analogous to word segmentation in many Asian languages โ€‹โ€‹that do not explicitly encode word boundaries (e.g., Chinese, Thai). If you would like information on approaches to the problem, I would recommend that you take a look at Google Scholar on existing approaches to the Chinese signature of Word.

You can start by looking at some old approaches: Raft, Richard, and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)

If you want a turnkey solution, I would recommend the LingPipe tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I used it on an unsegmented English text with good results. I have prepared a basic model of a symbol language for a couple of million words in the text of a news text, but I suspect that for this task you will get reasonable performance using any body relative to normal English text.

They used a spelling correction system to recommend candidates for โ€œcorrectionsโ€ (where candidate corrections are identical to the entry, but with spaces entered). Their spelling corrector is based on Levenshtein's editing distance; they simply prohibit substitution and transposition and limit valid inserts to only one space.

+3
source

Source: https://habr.com/ru/post/910484/


All Articles