How tokenize continuous words without space separators?

I am using Python with nltk. I need to process some English text without any spaces, but the word_tokenize function in nltk could not cope with such problems. So, how tokenize the text without any spaces. Are there any tools in Python?

+6
source share
2 answers

I do not know such tools, but the solution to your problem depends on the language.

For Turkish, you can scan the input of a text letter by letter and accumulate letters in a word. When you are sure that the accumulated word forms the correct word from the dictionary, you save it as a separate token, delete the buffer for accumulating a new word and continue the process.

You can try this for English, but I assume that you can find situations where the end of one word may be the beginning of some dictionary, and this may cause some problems.

+1
source

can Viterbi algorithm help? Not sure ... but probably better than doing it manually.

This answer to another SO question (and another high voice) might help: fooobar.com/questions/141068 / ...

+1
source

Source: https://habr.com/ru/post/949395/


All Articles