Detection of syllables in a word containing non-alphabetic characters

I am running a readability test and have implemented a simple run detection algorithm. Detecting Vowel Sequences I consider them words, for example, the word “shoud” contains one vowel sequence, which is “ou”. Before I recount them, I remove suffixes like -les, -e, -ed (for example, the word “like” contains one syllable, but two vowel sequences, so this method works).

But ... Consider these words / sequences:

  • x-ray (contains two syllables)
  • I (one syllable, maybe I can use the removal of all apostrophes in the text?)
  • goin '
  • I'd've
  • n '(e.g. Pork n' Beans)
  • 3rd (how to do it?)
  • 12345

What to do with special characters? Delete them all? This will be normal for most words, but not with "n" and "x-ray". And how to treat ciphers.

These are special cases of words, but I will be very glad to see some experience or ideas in this matter.

+3
source share
1 answer

I would advise you first to determine how much of your data consists of these words and how important it is for the overall performance of your program. Also compile some statistics about which species are most common.

There is no simple correct solution to this problem, but I can offer several heuristics:

  • A 'between two consonants ( shouldn't) seems to indicate the emission of a syllable
  • A ' (I'd, goin'), , ( , goin' )
  • , n', ,
  • (-)

3rd , .

+1

Source: https://habr.com/ru/post/1769884/


All Articles