How to split speech data into frames and calculate MFCC

Question

How to split speech data into frames and calculate MFCC

I understand the main steps in creating an automated speech recognition engine. However, I need a clear idea about how segmentation is performed and what the framework and patterns are. I will write what I know, and I expect that the defendant will correct me in those places where I am mistaken, and I will continue to lead me.

The main stages of speech recognition that I know are:

(I assume the input is wav / ogg (or some kind of audio file))

Pre-emphasize the speech signal: i.e. apply a filter that will focus on high-frequency signals. Perhaps something like: y [n] = x [n] - 0.95 x [n-1]
Find the time at which the utterances begin and resize the clip. (Interchangable in steps of 1)
Divide the clip into smaller time frames, each segment will last 30 ms. In addition, each segment will have about 256 frames, and the two segments will be divided into 100 frames? (i.e. 30 * 100/256 ms?)
Apply Hamming window to each frame (1/256 part of a segment)? The result is an array of signal frames.
Fast Fourier Transform of the signal of each frame represented by X (t)
Processing Bank Mel Filter: (not yet entered the part)
Discrete cosine transform: (not yet gone into detail), but be aware that this will give me a set of MFCCs, also called acoustic vectors for each input.
Delta Energy and Delta Spectrum: I know that this is used to calculate the delta and double delta coefficients of the MFCC, not so much.
After that, I think I need to use HMM or ANN to classify Mel Frequency cepstrum coefficients (delta and double delta) for the respective phonemes and perform analysis to match the phoneme words and words according to words.

Although this is clear to me, I am confused if step 3 is correct. If this is correct, follow these steps 3, does it apply to each frame? Also, after step 6, I think each frame has its own MFCC set, right?

Thank you in advance!

+5

speech-recognition speech speech-to-text cmusphinx

cipher Jan 08 '16 at 8:04

source share

1 answer

Nikolay Shmyrev · Accepted Answer · 2016-01-08T19:58:49+0000

Divide the clip into smaller time frames, each segment will last 30 ms. In addition, each segment will have about 256 frames, and the two segments will be divided into 100 frames? (i.e. 30 * 100/256 ms?)

Not frames, but samples. Each 30 ms frame with a sampling frequency of 8 kHz is 30/1000 * 8000 = 240 samples. Frames overlap, and the shift between frames is 10 ms or 80 samples. Here is how it looks in the picture:

Here, Q is 80 and K is 240 samples.

If this is correct, in the steps following 3, should I apply this to every frame?

Yes

In addition, after step 6, I think that each frame has its own set of MFCC, I'm right.

Yes.

How to split speech data into frames and calculate MFCC

More articles: