First, you need to annotate your events in sound streams, i.e. specify borders and labels for them.
Then transform your sounds into a sequence of feature vectors using feedback. A typical choice is the MFCC or log-mel filtebank functions (the latter corresponds to the sound spectrogram). By doing this, you will transform your sounds into a sequence of vectors of fixed-size functions that can be fed into the classifier. See this . for a better explanation.
Since typical sounds have a longer duration than the analysis frame, you probably need to stitch together several adjacent object vectors using a sliding window and use these stacked frames as input to your NN.
Now you have: a) input data and b) annotations for each analysis window. So you can try to train DNN or CNN or RNN to predict the sound class for each window. This task is known as definition. I suggest you read Sainath, TN and Parada, C. (2015). Convolutional neural networks for determining short-range keywords. Go to Proceedings INTERSPEECH (pp. 1478-1482) and follow its recommendations for more details.
source share