Creating a Speech Dataset for LSTM Binary Classification

Question

Creating a Speech Dataset for LSTM Binary Classification

I am trying to do a binary classification of LSTM using theano. I looked at the sample code, but I want to create my own.

I have a small set of "Hello" and "Goodbye" entries that I use. I pre-process them by extracting the MFCC functions for them and storing these functions in a text file. I have 20 speech files (10 each) and I generate a text file for each word, so there are 20 text files that contain MFCC functions. Each file is a 13x56 matrix.

Now my problem is: how to use this text file for training LSTM?

I am relatively new. I also read some publications, but did not find a really good understanding of the concept.

Any simpler ways to use LSTM are also welcome.

+5

python-2.7 theano speech-recognition lstm mfcc

Nirbhay tandon Jan 7 '16 at 17:47

source share

1 answer

Nikolay Shmyrev · Accepted Answer · 2016-01-07T23:28:39+0000

There are many existing implementations, for example, Tensorflow implementation , Kaldi-focused implementation with all scenarios , it is better to check them first.

Theano is too low level, you can try keras instead, as described in the tutorial . You can run the tutorial “as is” to understand how things are.

Then you need to prepare the data set. You need to turn your data into a sequence of data frames and for each data frame in a sequence you need to assign an output label.

Keras supports two types of RNN - layers that return sequences and layers that return simple values. You can experiment with both, in the code you just use return_sequences=True or return_sequences=False

To train using sequences, you can assign a dummy mark to all frames except the last one, where you can assign a mark to the word that you want to recognize. You must put input and output labels in arrays. So it will be:

 X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]] Y = [[0,0,...,1], [0,0,....,2]]

In X, each element is a vector of 13 floats. In Y, each element is just a number - 0 for intermediate frames and a word identifier for the final frame.

To train only with shortcuts, you need to put input and output labels in arrays, and the output array is simpler. Thus, the data will be:

 X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]] Y = [[0,0,1], [0,1,0]]

Note that the output is vectorized (np_utils.to_categorical) to turn it into vectors instead of primes.

Then you create a network architecture. You can have 13 floats for input, a vector for output. In the middle, you can have one fully connected layer followed by a single lstm layer. Do not use too large layers, start with small ones.

Then you load this dataset into model.fit and it trains you models. You can evaluate the quality of the model at a postponed level after training.

You will have a convergence problem, since you have only 20 examples. You need more examples, preferably thousands for LSTM training, you can use only very small models.

Creating a Speech Dataset for LSTM Binary Classification

More articles: