It is not possible to classify inputs of different sizes, but you can convert your signal into a sequence of vectors of characteristics of a fixed size (or into a sequence of fragments of the original sound of a fixed size). For sound, we usually use MFCC or just a spectrogram. Therefore, you need to apply methods that work with sequences. It can be a recurrent neural network, or you can use a data network, and then somehow post your outputs for each frame.
source share