How tokenize continuous words without space separators?

Question

How tokenize continuous words without space separators?

I am using Python with nltk. I need to process some English text without any spaces, but the word_tokenize function in nltk could not cope with such problems. So, how tokenize the text without any spaces. Are there any tools in Python?

+6

python tokenize nltk

Vcamx Jul 14 '13 at 6:42

source share

2 answers

Ivan Mushketyk · Answer 1 · 2013-07-14T07:01:36+0000

I do not know such tools, but the solution to your problem depends on the language.

For Turkish, you can scan the input of a text letter by letter and accumulate letters in a word. When you are sure that the accumulated word forms the correct word from the dictionary, you save it as a separate token, delete the buffer for accumulating a new word and continue the process.

You can try this for English, but I assume that you can find situations where the end of one word may be the beginning of some dictionary, and this may cause some problems.

arturomp · Answer 2 · 2013-07-15T15:25:21+0000

can Viterbi algorithm help? Not sure ... but probably better than doing it manually.

This answer to another SO question (and another high voice) might help: fooobar.com/questions/141068 / ...

How tokenize continuous words without space separators?

More articles: