How does speech length affect the neural network in speaker recognition?

I study neural networks and try to create a speaker recognition system using tensor flow. I wanted to know how speech length affects the neural network. For example, I have 1000 different sound recordings with the same length and 1000 different sound recordings with different lengths. So, how will a neural network theoretically work with these types of data? Will a neural network with a database of the same length work better or worse? Why?

+5
source share
2 answers

It depends on the type of neural network. When developing this type, you usually indicate the number of input neurons; a su cannot supply it with data of arbitrary length. In the case of longer sequences, you need to either crop your data or use a sliding window.

However, some neural networks allow you to process an arbitrary sequence of inputs, for example, a recurrent neural network . The latter seems to be a very good candidate for your problem. Here 's a good article describing the implementation of a certain type of RNN called Long Short-Term Memory , which work great with speech recognition.

+1
source

I assume that your question can be reformulated as: How can a neural network process audio of different lengths?

The trick is that a signal of an arbitrary size is converted into a sequence of feature vectors of a fixed size. See my answers here and here .

+1
source

Source: https://habr.com/ru/post/1262240/


All Articles