Minimum Approach:
- At your input, remove the space before any letters. Mark the final words created as part of this, somehow (for example, prefixing them with a character not at the input).
- Get a dictionary of English words, sorted the longest.
- For each tagged word in your input, find the longest match and break it like a word. Repeat on the characters remaining in the original βwordβ until nothing remains. (In case there is no match, just leave it alone.)
A more sophisticated overflow based approach:
The problem of splitting words without spaces is the problem of the real world in languages ββusually written without spaces, such as Chinese and Japanese. I am familiar with the Japanese, so I mainly speak with reference to this.
Typical approaches use a vocabulary and sequence model. The model learns to study the properties of the transition between labels - part of speech tags, in combination with a dictionary, is used to determine the relative probability of various potential places for word separation. Then the most probable splitting sequence for the whole sentence is solved for using (for example) the Viterbi algorithm.
Creating such a system would almost certainly be unnecessary if you are just clearing OCR data, but if you're interested, it might be worth a look.
An example case where a more complex approach will work, but a simple one will not:
- input:
Playforthefunofit - simple conclusion:
Play forth efunofit ( forth longer than for ) - difficult conclusion:
Play for the fun of it ( forth efunofit is a low-frequency - that is, unnatural - transition, but for the - not)
You can work around the problem with a simple approach to some extent by adding regular dictionary strings to the dictionary as units. For example, add forthe to the dictionary word and divide it in the post-processing phase.
Hope this helps - good luck!
source share