I understand the main steps in creating an automated speech recognition engine. However, I need a clear idea about how segmentation is performed and what the framework and patterns are. I will write what I know, and I expect that the defendant will correct me in those places where I am mistaken, and I will continue to lead me.
The main stages of speech recognition that I know are:
(I assume the input is wav / ogg (or some kind of audio file))
- Pre-emphasize the speech signal: i.e. apply a filter that will focus on high-frequency signals. Perhaps something like: y [n] = x [n] - 0.95 x [n-1]
- Find the time at which the utterances begin and resize the clip. (Interchangable in steps of 1)
- Divide the clip into smaller time frames, each segment will last 30 ms. In addition, each segment will have about 256 frames, and the two segments will be divided into 100 frames? (i.e. 30 * 100/256 ms?)
- Apply Hamming window to each frame (1/256 part of a segment)? The result is an array of signal frames.
- Fast Fourier Transform of the signal of each frame represented by X (t)
- Processing Bank Mel Filter: (not yet entered the part)
- Discrete cosine transform: (not yet gone into detail), but be aware that this will give me a set of MFCCs, also called acoustic vectors for each input.
- Delta Energy and Delta Spectrum: I know that this is used to calculate the delta and double delta coefficients of the MFCC, not so much.
- After that, I think I need to use HMM or ANN to classify Mel Frequency cepstrum coefficients (delta and double delta) for the respective phonemes and perform analysis to match the phoneme words and words according to words.
Although this is clear to me, I am confused if step 3 is correct. If this is correct, follow these steps 3, does it apply to each frame? Also, after step 6, I think each frame has its own MFCC set, right?
Thank you in advance!
source share