How to classify continuous audio

Question

How to classify continuous audio

I have a set of audio data and each one has a different length. There are some events in these audio files that I want to train and test, but these events are placed randomly, as well as different lengths, so it is very difficult to create a machine learning system using this data set. I thought that fixing the default size and building a multi-layer NN, however, the duration of the events is also different. Then I thought about using CNN, as if it was used to recognize images or several people in an image. The problem is that I really struggle when I try to understand the audio file.

So, my questions. Is there anyone who can give me some tips on creating a machine learning system that classifies various types of specific events with the training itself in a data set that has these events randomly (1 data contains more than 1 events and they differ from each other .) And each of them has different lengths?

I will be very grateful if anyone helps.

+4

python machine-learning classification conv-neural-network recurrent-neural-network

Faruk Dec 08 '16 at 19:16

source share

2 answers

You can use a recurrent neural network (RNN).

https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html

The input is a sequence, and you can put a label in each sample time series.

For example, LSTM (a type of RNN) is available in libraries such as tensorflow.

+2

Rob Dec 08 '16 at 23:21

source share

Dmytro prylipko · Accepted Answer · 2016-12-09T09:32:08+0000

First, you need to annotate your events in sound streams, i.e. specify borders and labels for them.

Then transform your sounds into a sequence of feature vectors using feedback. A typical choice is the MFCC or log-mel filtebank functions (the latter corresponds to the sound spectrogram). By doing this, you will transform your sounds into a sequence of vectors of fixed-size functions that can be fed into the classifier. See this . for a better explanation.

Since typical sounds have a longer duration than the analysis frame, you probably need to stitch together several adjacent object vectors using a sliding window and use these stacked frames as input to your NN.

Now you have: a) input data and b) annotations for each analysis window. So you can try to train DNN or CNN or RNN to predict the sound class for each window. This task is known as definition. I suggest you read Sainath, TN and Parada, C. (2015). Convolutional neural networks for determining short-range keywords. Go to Proceedings INTERSPEECH (pp. 1478-1482) and follow its recommendations for more details.

How to classify continuous audio

More articles: