This is very similar to this question and has a very similar answer. You need to separate the audio part, convert it to the WAV format and send it to the inproc recognizer.
However, he has the same problems that I spoke about earlier (requires training, assumes one voice and assumes that the microphone is close to the speaker). If so, then you are likely to get pretty good results. If this is not the case (i.e., you are trying to decrypt a television show or, even worse, some kind of sound in the camcorder), then the results are likely to be unsatisfactory.
source
share