How to calculate syllables in regular expression text and Java

Question

How to calculate syllables in regular expression text and Java

I have text as a String and you need to calculate the number of syllables in each word. I tried to break all the text into an array of words and process each word separately. I used regular expressions for this. But the syllable pattern is not working properly. Please advice on how to change it in order to calculate the correct number of syllables. My source code.

 public int getNumSyllables() { String[] words = getText().toLowerCase().split("[a-zA-Z]+"); int count=0; List <String> tokens = new ArrayList<String>(); for(String word: words){ tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*")); count+= tokens.size(); } return count; }

+5

java string arrays regex

burzakovskiy Oct 29 '15 at 9:37

source share

8 answers

This gives you several vowels in a word:

 public int getNumVowels(String word) { String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*"; Pattern p = Pattern.compile(regexp); Matcher m = p.matcher(word.toLowerCase()); int count = 0; while (m.find()) { count++; } return count; }

You can call it on every word in your string array:

  String[] words = getText().split("\\s+"); for (String word : words ) { System.out.println("Word: " + word + ", vowels: " + getNumVowels(word)); }

Update: as the freeroner noted, calculating the number of syllables is more difficult than just counting the vowels. Combinations such as ou, ui, oo, the ultimate silent e, and possibly something else, need to be considered. Since I am not a native speaker of English, I am not sure what the correct algorithm will be.

+2

user5500105 Oct 29 '15 at 21:57

source share

Using the concept of user5500105, I developed the following method for calculating the number of syllables in a word. Rules:

consecutive vowels are considered 1 syllable. eg. "ae" "ou" - 1 syllable
Y is considered a vowel

e at the end is considered a syllable if e is the only vowel: for example, “one” is one syllable, since “e” at the end is the only vowel, and “there” is also 1 syllable because “e” is at the end, and there is another vowel in the word.

 public int countSyllables(String word) { ArrayList<String> tokens = new ArrayList<String>(); String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*"; Pattern p = Pattern.compile(regexp); Matcher m = p.matcher(word.toLowerCase()); while (m.find()) { tokens.add(m.group()); } //check if e is at last and e is not the only vowel or not if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") ) return tokens.size()-1; // e is at last and not the only vowel so total syllable -1 return tokens.size();

}

+2

freerunner Dec 7 '15 at 20:51

source share

This is how I do it. This is about the same simple algorithm that I could come up with.

  public static int syllables(String s) { final Pattern p = Pattern.compile("([ayeiou]+)"); final String lowerCase = s.toLowerCase(); final Matcher m = p.matcher(lowerCase); int count = 0; while (m.find()) count++; if (lowerCase.endsWith("e")) count--; return count < 0 ? 1 : count; }

I use this in conjunction with the soundex function to determine if words sound the same. The syllable code increases the accuracy of my soundex function.

Note. This is strictly for counting syllables in a word. I assume that you can parse your input for words using something like java.util.StringTokenizer .

+2

Armand May 18, '16 at 22:17

source share

Your line

 String[] words = getText().toLowerCase().split("[a-zA-Z]+");

splits the words ON and returns only a space between words! You want to divide the space between words as follows:

 String[] words = getText().toLowerCase().split("\\s+");

0

Nickj Oct 29 '15 at 21:46

source share

you can do it like this:

 public int getNumSyllables() { return getSyllables(getTokens("[a-zA-Z]+")); } protected List<String> getWordTokens(String word,String pattern) { ArrayList<String> tokens = new ArrayList<String>(); Pattern tokSplitter = Pattern.compile(pattern); Matcher m = tokSplitter.matcher(word); while (m.find()) { tokens.add(m.group()); } return tokens; } private int getSyllables(List<String> tokens) { int count=0; for(String word : tokens) if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0) count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size(); else count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size(); return count; }

0

Mohammed Saad Mostafa Apr 16 '16 at 3:18

source share

I count the separately, then break the text based on words that end in e.
Then, counting the syllables, here is my implementation:

 int syllables = 0; word = word.toLowerCase(); if(word.contains("the ")){ syllables ++; } String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$"); ArrayList<String> tokens = new ArrayList<String>(); Pattern tokSplitter = Pattern.compile("[aeiouy]+"); for (int i = 0; i < split.length; i++) { String s = split[i]; Matcher m = tokSplitter.matcher(s); while (m.find()) { tokens.add(m.group()); } } syllables += tokens.size();

I tested all test cases.

0

Ima miri May 27 '16 at 4:31

source share

You are using the split method incorrectly. This method gets a separator. You need to write something like this:

 String[] words = getText().toLowerCase().split(" ");

But if you want to count the number of syllables, just count the number of vowels:

 String input = "text"; Set<Character> vowel = new HashSet<>(); vowel.add('a'); vowel.add('e'); vowel.add('i'); vowel.add('o'); vowel.add('u'); int count = 0; for (char c : input.toLowerCase().toCharArray()) { if (vowel.contains(c)){ count++; } } System.out.println("count = " + count);

-1

user2224429 Oct 29 '15 at 10:01

source share

Anthonyeef · Accepted Answer · 2015-12-28T09:26:22+0000

This question is related to the Java UCSD course, am I right?

I think you should provide enough information for this issue so that it does not confuse people who want to provide some help. And here I have my own solution, which has already been verified by a test case from the local program, as well as OJ from UCSD.

You have missed important information about the definition of a syllable in this matter. In fact, I believe that the key to this problem is how you should deal with e . For example, suppose a combination te exists. And if you put te in the middle of a word, of course, it should be considered a syllable; However, if this is at the end of the word, e should be considered as silent e in English, therefore it should not be considered as a syllable.

What is it. And I would like to write down my thought using some pseudocode:

  if(last character is e) { if(it is silent e at the end of this word) { remove the silent e; count the rest part as regular; } else { count++; } else { count it as regular; } }

You may find that I am not only using regex to solve this problem. Actually, I thought about this: is it really possible to make this question only using regular expression? My answer is no, I don’t think so. At least right now, with the knowledge that UCSD gives us, it's too hard to do. Regex is a powerful tool, it can display the desired characters very quickly. However, the regular expression lacks some functionality. Take te as an example again, the regular expression won’t be able to think twice when it refers to a word like teate (for example, I composed this word). If our regex pattern counts the first te as a syllable, then why isn't the last te ?

Meanwhile, UCSD actually talked about this in the assignment document:

If you find yourself doing mental gymnastics in order to create one regular expression for directly counting syllables, this usually indicates that there is a simpler solution (hint: consider a cycle over symbols - see the following hint below). Just because a piece of code (such as a regular expression) is shorter does not mean that it is always better.

The hint here is that you should think about this problem along with some loop matching regex.

OK, I should finally show my code:

 protected int countSyllables(String word) { // TODO: Implement this method so that you can call it from the // getNumSyllables method in BasicDocument (module 1) and // EfficientDocument (module 2). int count = 0; word = word.toLowerCase(); if (word.charAt(word.length()-1) == 'e') { if (silente(word)){ String newword = word.substring(0, word.length()-1); count = count + countit(newword); } else { count++; } } else { count = count + countit(word); } return count; } private int countit(String word) { int count = 0; Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+"); Matcher m = splitter.matcher(word); while (m.find()) { count++; } return count; } private boolean silente(String word) { word = word.substring(0, word.length()-1); Pattern yup = Pattern.compile("[aeiouy]"); Matcher m = yup.matcher(word); if (m.find()) { return true; } else return false; }

You may find that in addition to this countSyllables method, I also create two additional methods, countit and silente . countit intended for counting syllables inside a word, silente tries to understand that this word ends with silent e . And it should also be noted that the definition is not silent e . For example, the should be considered not silent e , and ate is considered silent e .

And now the status of my code has already passed the test, both from the local test case and from OJ from UCSD:

And from OJ the test result:

PS: It should be nice to use something like [^ aeiouy] directly, because the word is parsed before we call this method. It is also necessary to make changes to lowercase letters, which will save a lot of work associated with capital letters. We need only the number of syllables. Speaking of number, the elegant way is to define count as static, so a private method can directly use count++ internally. But now all is well.

Feel free to contact me if you have not yet received a method for this question :)

How to calculate syllables in regular expression text and Java

More articles: