I am running a readability test and have implemented a simple run detection algorithm. Detecting Vowel Sequences I consider them words, for example, the word “shoud” contains one vowel sequence, which is “ou”. Before I recount them, I remove suffixes like -les, -e, -ed (for example, the word “like” contains one syllable, but two vowel sequences, so this method works).
But ... Consider these words / sequences:
- x-ray (contains two syllables)
- I (one syllable, maybe I can use the removal of all apostrophes in the text?)
- goin '
- I'd've
- n '(e.g. Pork n' Beans)
- 3rd (how to do it?)
- 12345
What to do with special characters? Delete them all? This will be normal for most words, but not with "n" and "x-ray". And how to treat ciphers.
These are special cases of words, but I will be very glad to see some experience or ideas in this matter.
dfens source
share