What algorithms can group characters into words?

I have text generated by some lousy OCR software.

The output contains a mixture of words and characters, separated by spaces that should have been grouped into words. For instance,

Expr ession Syntax S ummaryof T er minology

should be

 Expression Syntax Summary of Terminology

What algorithms can group characters into words?

If I program in Python, C #, Java, C or C ++, which libraries provide the implementation of the algorithms?

Thanks.

+5

Tim Jul 04 '17 at 0:02

1 answer

Minimum Approach:

At your input, remove the space before any letters. Mark the final words created as part of this, somehow (for example, prefixing them with a character not at the input).
Get a dictionary of English words, sorted the longest.
For each tagged word in your input, find the longest match and break it like a word. Repeat on the characters remaining in the original “word” until nothing remains. (In case there is no match, just leave it alone.)

A more sophisticated overflow based approach:

The problem of splitting words without spaces is the problem of the real world in languages usually written without spaces, such as Chinese and Japanese. I am familiar with the Japanese, so I mainly speak with reference to this.

Typical approaches use a vocabulary and sequence model. The model learns to study the properties of the transition between labels - part of speech tags, in combination with a dictionary, is used to determine the relative probability of various potential places for word separation. Then the most probable splitting sequence for the whole sentence is solved for using (for example) the Viterbi algorithm.

Creating such a system would almost certainly be unnecessary if you are just clearing OCR data, but if you're interested, it might be worth a look.

An example case where a more complex approach will work, but a simple one will not:

input: Playforthefunofit
simple conclusion: Play forth efunofit ( forth longer than for )
difficult conclusion: Play for the fun of it ( forth efunofit is a low-frequency - that is, unnatural - transition, but for the - not)

You can work around the problem with a simple approach to some extent by adding regular dictionary strings to the dictionary as units. For example, add forthe to the dictionary word and divide it in the post-processing phase.

Hope this helps - good luck!

+4

polm23 Jul 04 '17 at 4:25

Source: https://habr.com/ru/post/1269461/

All Articles