How to extract variables from speech recognition

Question

How to extract variables from speech recognition

I use System.Speech to recognize some phrases or words. One of them is Set timer . I would like to extend this to Set timer for X seconds , and with the code set the timer to X seconds. Is it possible? So far, I have little experience with this, all I could find is that I have to do something with the grammar class.

Currently, I have created my recognition engine as follows:

 SpeechRecognitionEngine = new SpeechRecognitionEngine(); SpeechRecognitionEngine.SetInputToDefaultAudioDevice(); var choices = new Choices(); choices.Add("Set timer"); var gb = new GrammarBuilder(); gb.Append(choices); var g = new Grammar(gb); SpeechRecognitionEngine.LoadGrammarAsync(g); SpeechRecognitionEngine.RecognizeAsync(RecognizeMode.Multiple); SpeechRecognitionEngine.SpeechRecognized += OnSpeechRecognized;

Is there any way to do this?

+5

c # speech-recognition system.speech.recognition

Randomstranger Mar 25 '18 at 18:40

source share

1 answer

Evk · Accepted Answer · 2018-03-28T10:12:15+0000

Firstly, there is no built-in concept of number. Speech is just a sequence of words, and if you need to recognize numbers, you need to recognize words that mean numbers, such as "one" and "fifteen." Some numbers are represented in a few words, such as “one hundred” or “fifty-one” - you also need to recognize them.

You can start by simply recognizing numbers from 1 to 9:

 var engine = new SpeechRecognitionEngine(CultureInfo.GetCultureInfo("en-US")); engine.SetInputToDefaultAudioDevice(); var num1To9 = new Choices( new SemanticResultValue("one", 1), new SemanticResultValue("two", 2), new SemanticResultValue("three", 3), new SemanticResultValue("four", 4), new SemanticResultValue("five", 5), new SemanticResultValue("six", 6), new SemanticResultValue("seven", 7), new SemanticResultValue("eight", 8), new SemanticResultValue("nine", 9)); var gb = new GrammarBuilder(); gb.Culture = CultureInfo.GetCultureInfo("en-US"); gb.Append("set timer for"); gb.Append(num1To9); gb.Append("seconds"); var g = new Grammar(gb); engine.LoadGrammar(g); // better not use LoadGrammarAsync engine.SpeechRecognized += OnSpeechRecognized; engine.RecognizeAsync(RecognizeMode.Multiple); Console.WriteLine("Speak"); Console.ReadKey();

So, our grammar can be read as:

"Set a timer for the phrase
followed by "one" OR "two" OR "three" ...
and then "seconds"

We use SemanticResultValue to assign a tag to a specific phrase. In this case, this tag is a number (1,2,3 ...) corresponding to a specific word ("one", "two", "three"). By doing this, you can extract this value from the recognition result:

 private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) { var numSeconds = (int)e.Result.Semantics.Value; Console.WriteLine($"Starting timer for {numSeconds} seconds..."); }

This is a working example that recognizes your phrases like “set a timer for five seconds” and allows you to extract the semantic value (5) from them.

Now you can combine different number words together, for example:

 var num10To19 = new Choices( new SemanticResultValue("ten", 10), new SemanticResultValue("eleven", 11), new SemanticResultValue("twelve", 12), new SemanticResultValue("thirteen", 13), new SemanticResultValue("fourteen", 14), new SemanticResultValue("fifteen", 15), new SemanticResultValue("sexteen", 16), new SemanticResultValue("seventeen", 17), new SemanticResultValue("eighteen", 18), new SemanticResultValue("nineteen", 19) ); var numTensFrom20To90 = new Choices( new SemanticResultValue("twenty", 20), new SemanticResultValue("thirty", 30), new SemanticResultValue("forty", 40), new SemanticResultValue("fifty", 50), new SemanticResultValue("sixty", 60), new SemanticResultValue("seventy", 70), new SemanticResultValue("eighty", 80), new SemanticResultValue("ninety", 90) ); var num20to99 = new GrammarBuilder(); // first word is "twenty", "thirty" etc num20to99.Append(numTensFrom20To90); // followed by ONE OR ZERO "digit" words ("one", "two", "three" etc) num20to99.Append(num1To9, 0, 1);

But it becomes difficult to assign them semantic values correctly, because this api with GrammarBuilder not efficient enough for this.

When what you want to do cannot be (easily) done with pure GrammarBuilder and related classes - you need to use more powerful XML files, with the syntax defined in this one .

The description of these grammar files is beyond the scope of this question, but fortunately for your task there is already a grammar file specified in the Microsoft Speech SDK, which you probably already downloaded and installed. So, copy the file from "C: \ Program Files \ Microsoft SDKs \ Speech \ v11.0 \ Samples \ Sample Grammars \ en-US.grxml" (or wherever you install the SDK) and delete some non-local things, for example, the first <tag> element with large CDATA inside.

The rule of interest in this file is called a "cardinal" and allows you to recognize numbers from 0 to 1 million. Then our code will be:

 var sampleDoc = new SrgsDocument(@"en-US-sample.grxml"); sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US"); // define new rule, named Timer SrgsRule rootRule = new SrgsRule("Timer"); // match "set timer for" phrase rootRule.Add(new SrgsItem("set timer for")); // followed by whatever "Cardinal" rule defines (reference to another rule) rootRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"])); // followed by "seconds" rootRule.Add(new SrgsItem("seconds")); // add to rules sampleDoc.Rules.Add(rootRule); // make it a root rule, so that it will be used for recognition sampleDoc.Root = rootRule; var g = new Grammar(sampleDoc); engine.LoadGrammar(g); // better not use LoadGrammarAsync engine.SpeechRecognized += OnSpeechRecognized; engine.RecognizeAsync(RecognizeMode.Multiple);

And the handler becomes:

 private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) { var numSeconds = Convert.ToInt32(e.Result.Semantics.Value); Console.WriteLine($"Starting timer for {numSeconds} seconds..."); }

Now you can define numbers up to 1 million.

Of course, there is no need to define a rule in the code, as we did above - you can completely define all your rules in xml, and then just load it as SrgsDocument and create a Grammar from it.

If you want to recognize several commands, here is an example:

 var sampleDoc = new SrgsDocument(@"en-US-sample.grxml"); sampleDoc.Culture = CultureInfo.GetCultureInfo("en-US"); // this rule is the same as above var setTimerRule = new SrgsRule("SetTimer"); setTimerRule.Add(new SrgsItem("set timer for")); setTimerRule.Add(new SrgsRuleRef(sampleDoc.Rules["Cardinal"])); setTimerRule.Add(new SrgsItem("seconds")); sampleDoc.Rules.Add(setTimerRule); // new rule, clear timer var clearTimerRule = new SrgsRule("ClearTimer"); // just match this phrase clearTimerRule.Add(new SrgsItem("clear timer")); sampleDoc.Rules.Add(clearTimerRule); // new root rule, marching either set timer OR clear timer var rootRule = new SrgsRule("Times"); rootRule.Add(new SrgsOneOf( // << OneOf is basically the same as Choice // reference to SetTimer new SrgsItem(new SrgsRuleRef(setTimerRule), // assign command name. Both "command" and "settimer" are arbitrary names I chose new SrgsSemanticInterpretationTag("out = rules.latest();out.command = 'settimer';")), new SrgsItem(new SrgsRuleRef(clearTimerRule), // assign command name. If this rule "wins" - command will be cleartimer new SrgsSemanticInterpretationTag("out.command = 'cleartimer';")) )); sampleDoc.Rules.Add(rootRule); sampleDoc.Root = rootRule; var g = new Grammar(sampleDoc);

And the handler becomes:

 private static void OnSpeechRecognized(object sender, SpeechRecognizedEventArgs e) { var sem = e.Result.Semantics; // here "command" is arbitrary key we assigned in our rule var commandName = (string) sem["command"].Value; switch (commandName) { // also arbitrary values we assigned, not related to rule names or something else case "settimer": var numSeconds = Convert.ToInt32(sem.Value); Console.WriteLine($"Starting timer for {numSeconds} seconds..."); break; case "cleartimer": Console.WriteLine("timer cleared"); break; } }

For completeness, here's how you can do the same with pure xml. Open this file "en-US-sample.grxml" with the xml editor and add the rules defined above in the code. They will look like this:

 <rule id="SetTimer" scope="private"> <item>set timer for</item> <item> <ruleref uri="#Cardinal" /> </item> <item>seconds</item> </rule> <rule id="ClearTimer" scope="private"> <item>clear timer</item> </rule> <rule id="Timers" scope="public"> <one-of> <item> <ruleref uri="#SetTimer" /> <tag>out = rules.latest(); out.command = 'settimer'</tag> </item> <item> <ruleref uri="#ClearTimer" /> <tag>out.command = 'cleartimer'</tag> </item> </one-of> </rule>

Now set the root rule in the root grammar tag:

 <grammar xml:lang="en-US" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" root="Timers">

And save.

Now we don’t need to define anything at all in the code, all we need to do is upload our grammar file:

 var sampleDoc = new SrgsDocument(@"en-US-sample.grxml"); var g = new Grammar(sampleDoc); engine.LoadGrammar(g);

It's all. Since the Timers rule is the root rule in the grammar file, it will be used in recognition and will behave exactly like the version defined in the code.

How to extract variables from speech recognition

More articles: