Define the “most similar pattern” for the recorded audio clip

The user trains the program with approximately 100 short recorded samples, 0.5-5 seconds. These patterns can be notes / phrases on a musical instrument or various percussion effects or mouth sounds.

The program then tries to identify these samples in the audio input (which most likely transmits streaming audio from the microphone)

Problems

1) Define the boundaries of the sample in the stream input. At the moment, I assume that all samples should begin with some significant surge in volume or “attack”. Each time the volume of the input volume increases, which becomes a candidate for sampling.

2) Given a viable candidate in the audio stream, how to find the closest match in the sample set? This will be primarily determined by the frequency / pitch of the sound corresponding to the sample. Ideally, this would support samples with more than one frequency. If it were possible to compare other factors, such as tamber, waveform, etc., this would be a big bonus.

Note. I do not need to know the actual step of the input fragment, only the one of the pre-recorded samples that most resembles, and how much.

In short, this is a bit like text-to-speech, but with a much smaller sample size, which can make alternative processing algorithms doable.

Should I apply some FFT to inputs and samples to determine close match?

Is there any other type of algorithm that can be used to determine the similarities between two sounds? Any pre-existing libraries that might be useful (Java, ObjC, etc.)?

I found a wikipedia entry for Computer Audition that says

Sound comparisons can be made by comparing features with or without reference to time. In some cases, the overall similarity can be estimated by similar values ​​of the characteristics between the two sounds. In other cases, when the temporal structure is important, the methods of dynamic time deformation must be applied for the "correct" for different time scales of acoustic events. Finding repetitions and similar subsequences of sound events is important for tasks such as texture synthesis and machine improvisation.

But there is very little information on the way to further study. I understand that this question is not clear at this moment, so any pointers or directions for further research are very welcome.

+4
source share
1 answer

There are two basic approaches you can take, both of which require serious programming ability.

The first is to use the hidden Markov model. If you learn about "speech recognition" and the "hidden Markov model" that will help you get started.

A new approach is to use what is called “deep learning” either through a neural network (rigid) or into the network of the total product (much simpler). There is a document called "Deep Learning and its Signal and Information Processing [Exploratory DSP] ​​Applications" that can help you get started.

0
source

Source: https://habr.com/ru/post/1435193/


All Articles