Create a list of English words containing consecutive consonant sounds

Question

Create a list of English words containing consecutive consonant sounds

Start with this:

[G|C] * [T] *

Write a program that generates this:

 Cat Cut Cute City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start. Caught ... Gate Gotti Gut ... Kit Kite Kate Kata Katie

Another example:

[C] * [T] * [N]

The following should occur:

Cotton Kitten

Where should I start my research when I figure out how to write a program / script that does this?

+4

algorithm nlp

dreftymac Feb 18 '10 at 22:21

source share

6 answers

Rich · Answer 1 · 2010-02-18T22:25:59+0000

You can do this using regular expressions for a dictionary containing phonetic versions of words.

Here is an example in Javascript:

  <html> <head> <title>Test</title> <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script> <script> $.get('cmudict0.3',function (data) { matches = data.match(/^(\S*)\s+K.*\sT.*\sN$/mg); $('body').html('<p>'+matches.join('<br/> ')+'</p>'); }) </script> </head> <body> </body> </html>

You will need to download the list of all words from http://icon.shef.ac.uk/Moby/mpron.tar.Z and place it (uncompressed) in the same folder as the HTML file. I just translated the version of [C] * [T] * [N] into a regular expression and the result is not very nice, but it will give you this idea. Here's a sample output:

 CALTON K AE1 LT AH0 N CAMPTON K AE1 MPT AH0 N CANTEEN K AE0 NT IY1 N CANTIN K AA0 NT IY1 N CANTLIN K AE1 NTL IH0 N CANTLON K AE1 NTL AH0 N ... COTTERMAN K AA1 T ER0 M AH0 N COTTMAN K AA1 TM AH0 N COTTON K AA1 T AH0 N COTTON(2) K AO1 T AH0 N COULSTON K AW1 LST AH0 N COUNTDOWN K AW1 NTD AW2 N .. KITSON K IH1 TS AH0 N KITTELSON K IH1 T IH0 LS AH0 N KITTEN K IH1 T AH0 N KITTERMAN K IH1 T ER0 M AH0 N KITTLESON K IH1 TL IH0 S AH0 N ...

Justin peel · Answer 2 · 2010-02-18T22:53:52+0000

You need a list of words or a dictionary that uses something like the International Phonetic Alphabet or some other standard phonetic way of writing words. He will have to have a list of English words and their corresponding phonetic spellings. I have no idea where you will get it, because I do not think that standard dictionary manufacturers simply transmit such information.

ablerman · Answer 3 · 2010-02-18T23:10:10+0000

You need mobile pronunciation. This is part of the mobile word project.

Here you will find an explanation and links to documents: http://en.wikipedia.org/wiki/Moby_Project

Moby pronounciation is a list of approximately 170 thousand words and their phonetic pronunciations.

From there, it should be a relatively direct process for creating a program.

ferdystschenko · Answer 4 · 2010-02-18T22:30:25+0000

One approach would be to convert the English pronunciation dictionary into a finite state machine, and then search for it using a regular expression or a simple template. You can also compile such a dictionary yourself by executing a list of English words through a program that creates phonetic transcriptions, for example. as on those sites that are on these sites:

Finding a mechanism to go back from phonetic transcription to standard spelling should be easy.

danben · Answer 5 · 2010-02-18T22:46:31+0000

A phoneme is "the smallest unit of sound used to create meaningful contrasts between utterances." I understand that this is the basis for pronunciation based spelling correction systems. Misspelling newspaper like noospaypr can generate the correct correction, despite the long editing distance between the two words, because the corresponding segments in each word (oo and ew, pa and pay, per and pr) can be converted into the same phoneme.

Unfortunately, in a couple of minutes from me Google did not find libraries that will perform the conversion for English words, but that is where I will start.

hashable · Answer 6 · 2010-02-25T09:45:12+0000

You can do this using the steps I described. First, I will describe an algorithm followed by some (unverified and possibly broken) Java code.

Note. I will use the apache commons-codec library.

Algorithm:

Use a regular expression to represent an input pattern.
In the vocabulary of "valid known words", filter out the subset that matches your regular expression. Let me call this subset (MS)
Use the Double Metaphone algorithm to encode these words from MS.
Apply some phonetic filtering to reduce MS to your needs.

To illustrate how steps 3 and 4 work, I will first show you the result of the Double Metaphone algorithm in the five words that you suggested as examples: Cute, Cat, Cut, Caught, City

Code A (illustrating a double metaphone):

 private static void doubleMetaphoneTest() { org.apache.commons.codec.language.DoubleMetaphone dm = new DoubleMetaphone(); System.out.println("Cute\t"+dm.encode("Cute")); System.out.println("Cat\t"+dm.encode("Cat")); System.out.println("Cut\t"+dm.encode("Cut")); System.out.println("Caught\t"+dm.encode("Caught")); System.out.println("City\t"+dm.encode("City")); }

Code A output

 Cute KT Cat KT Cut KT Caught KFT City ST

Now in your question, you stated that City is not the right solution, because it starts with the sound of ESS. A double metaphone will help you pinpoint this issue (although I'm sure there will be times when it does not help). Now you can apply step 4 in the algorithm using this principle.

In the following code for step 4 (apply some phonetic filtering), I assume that you already know that you only need the sound “K” and not the sound “S”.

Code B (prototype solution for the whole question)

Note. This code is intended to illustrate the use of the DoubleMetaphone algorithm for your purpose. I do not run the code. The regular expression may be broken or it may be very lame, or my use of the Matcher pattern may be wrong (this is 2AM now). If it is wrong, improve / correct it.

 import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.commons.codec.language.DoubleMetaphone; public class GenerateWords { /** * Returns a set of words that conform to the input pattern * @param inputPattern a regular expression * @param lexicon a list of valid words */ public static List<String> fetchMatchingWordsFromLexicon(String inputPattern, List<String> lexicon){ /* Eg for the case [C] * [T] * [N] * the regex is: * [Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+ */ Pattern p = Pattern.compile(inputPattern); List<String> result = new ArrayList<String>(); for(String aWord:lexicon){ Matcher m = p.matcher(aWord); if(m.matches()){ result.add(aWord); } } return result; } /** * Returns the subset of the input list that "phonetically" begins with the character specified. * Eg The word 'cat' begins with 'K' and the word 'city' begins with 'S' * @param prefix * @param possibleWords * @return */ public static List<String> filterWordsBeginningWithMetaphonePrefix(char prefix, List<String> possibleWords){ List<String> result = new ArrayList<String>(); DoubleMetaphone dm = new DoubleMetaphone(); for(String aWord:possibleWords){ String phoneticRepresentation = dm.encode(aWord); // this will always return in all caps // check if the word begins with the prefix char of interest if(phoneticRepresentation.indexOf(0)==Character.toUpperCase(prefix)){ result.add(aWord); } } return result; } public static void main(String args[]){ // I have not implemented this method to read a text file etc. List<String> lexicon = readLexiconFromFileIntoList(); String regex = "[Cc]+[aeiouyAEIOUY]+[Tt]+[aeiouyAEIOUY]+[Nn]+[aeiouyAEIOUY]+"; List<String> possibleWords = fetchMatchingWordsFromLexicon(regex,lexicon); // your result List<String> result = filterWordsBeginningWithMetaphonePrefix('C', possibleWords); // print result or whatever } }

Create a list of English words containing consecutive consonant sounds

Algorithm:

Code A (illustrating a double metaphone):

Code A output

Code B (prototype solution for the whole question)

More articles: