I need to train a bi-directional LSTM model for discrete speech recognition (individual numbers from 0 to 9). I recorded a speech from 100 speakers. What should I do next? (Suppose I split them into separate .wav files containing one number per file). I will use mfcc as functions for the network.
In addition, I would like to know the difference in the data set if I am going to use a library that supports CTC (temporary classification Connectionist)
udani source
share