This problem is completely analogous to word segmentation in many Asian languages โโthat do not explicitly encode word boundaries (e.g., Chinese, Thai). If you would like information on approaches to the problem, I would recommend that you take a look at Google Scholar on existing approaches to the Chinese signature of Word.
You can start by looking at some old approaches: Raft, Richard, and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)
If you want a turnkey solution, I would recommend the LingPipe tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I used it on an unsegmented English text with good results. I have prepared a basic model of a symbol language for a couple of million words in the text of a news text, but I suspect that for this task you will get reasonable performance using any body relative to normal English text.
They used a spelling correction system to recommend candidates for โcorrectionsโ (where candidate corrections are identical to the entry, but with spaces entered). Their spelling corrector is based on Levenshtein's editing distance; they simply prohibit substitution and transposition and limit valid inserts to only one space.
source share