You can do this using the steps I described. First, I will describe an algorithm followed by some (unverified and possibly broken) Java code.
Note. I will use the apache commons-codec library.
Algorithm:
- Use a regular expression to represent an input pattern.
- In the vocabulary of "valid known words", filter out the subset that matches your regular expression. Let me call this subset (MS)
- Use the Double Metaphone algorithm to encode these words from MS.
- Apply some phonetic filtering to reduce MS to your needs.
To illustrate how steps 3 and 4 work, I will first show you the result of the Double Metaphone algorithm in the five words that you suggested as examples: Cute, Cat, Cut, Caught, City
Code A (illustrating a double metaphone):
private static void doubleMetaphoneTest() { org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone(); System.out.println("Cute\t"+dm.encode("Cute")); System.out.println("Cat\t"+dm.encode("Cat")); System.out.println("Cut\t"+dm.encode("Cut")); System.out.println("Caught\t"+dm.encode("Caught")); System.out.println("City\t"+dm.encode("City")); }
Code A output
Cute KT Cat KT Cut KT Caught KFT City ST
Now in your question, you stated that City is not the right solution, because it starts with the sound of ESS. A double metaphone will help you pinpoint this issue (although I'm sure there will be times when it does not help). Now you can apply step 4 in the algorithm using this principle.
In the following code for step 4 (apply some phonetic filtering), I assume that you already know that you only need the sound “K” and not the sound “S”.
Code B (prototype solution for the whole question)
Note. This code is intended to illustrate the use of the DoubleMetaphone algorithm for your purpose. I do not run the code. The regular expression may be broken or it may be very lame, or my use of the Matcher pattern may be wrong (this is 2AM now). If it is wrong, improve / correct it.
import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.commons.codec.language.DoubleMetaphone; public class GenerateWords { public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){ Pattern p = Pattern.compile(inputPattern); List<String> result = new ArrayList<String>(); for(String aWord:lexicon){ Matcher m = p.matcher(aWord); if(m.matches()){ result.add(aWord); } } return result; } public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){ List<String> result = new ArrayList<String>(); DoubleMetaphone dm = new DoubleMetaphone(); for(String aWord:possibleWords){ String phoneticRepresentation = dm.encode(aWord);
source share