Detection of syllables in a word containing non-alphabetic characters

Question

Detection of syllables in a word containing non-alphabetic characters

I am running a readability test and have implemented a simple run detection algorithm. Detecting Vowel Sequences I consider them words, for example, the word “shoud” contains one vowel sequence, which is “ou”. Before I recount them, I remove suffixes like -les, -e, -ed (for example, the word “like” contains one syllable, but two vowel sequences, so this method works).

But ... Consider these words / sequences:

x-ray (contains two syllables)
I (one syllable, maybe I can use the removal of all apostrophes in the text?)
goin '
I'd've
n '(e.g. Pork n' Beans)
3rd (how to do it?)
12345

What to do with special characters? Delete them all? This will be normal for most words, but not with "n" and "x-ray". And how to treat ciphers.

These are special cases of words, but I will be very glad to see some experience or ideas in this matter.

+3

nlp readability spell-checking

dfens Oct 16 '10 at 17:29

source share

1 answer

Fred Foo · Accepted Answer · 2010-10-17T10:58:03+0000

I would advise you first to determine how much of your data consists of these words and how important it is for the overall performance of your program. Also compile some statistics about which species are most common.

There is no simple correct solution to this problem, but I can offer several heuristics:

A 'between two consonants ( shouldn't) seems to indicate the emission of a syllable
A ' (I'd, goin'), , ( , goin' )
, n', ,
(-)

3rd , .

Detection of syllables in a word containing non-alphabetic characters

More articles: