How to find out what time a part of a sound starts and ends with another sound?

I have two audio files that read a sentence (for example, singing a song) by two different people. Thus, they have different lengths. They are just vocal, they have no instruments.

A1: Audio File 1
A2: Audio file 2
Example sentence: "Lorem ipsum dolor sit amet, ..."

structure of exemplary audio files

I know the time when every word begins and ends in A1. And I need to automatically find that at what time each word begins and ends on A2. (Any language, preferably Python or C #)

Time is saved in XML. So, I can split the A1 file by word. So, how to find the sound of a word in another sound that has different durations (words) and a different voice?

+5
source share
3 answers

So, from what I read, it seems you would like to use Dynamic Time Warping (DTW) . Of course, I will leave an explanation for Wikipedia, but it is usually used to recognize speech patterns without receiving noise from different pronunciations.

Unfortunately, I am more knowledgeable in C, Java and Python. Therefore, I will offer python libraries.

With rpy2, you can actually use the R library and use their DTW implementation in your Python code. Unfortunately, I could not find good tutorials for this, but there are good examples if you decide to use R.

Please let me know if this does not help, Cheers!

+3
source

Not knowing how sophisticated your understanding of the problem space is, it is not easy for you to understand whether you should point out or describe in detail why this problem is nontrivial. I would suggest starting with something like https://cloud.google.com/speech/ and trying to convert the speech blocks to text, and then do a similarity comparison on these. If you really want to try to process it yourself, you can look at which or spectrographic analysis. Take waveform data and do FFTs to get frequency distributions and find marker patterns that align your samples. When comparing only one word of different speakers, you probably won’t be able to use any neural network unless you can train them on two loudspeakers of the whole set of speech and use the network to then try to compare individual fragments of words. It has been several years since since I did this, so maybe it’s easier these days, but my memory is that, although it sounds conceptually simple, it can be more complicated than you understand. Dynamic Time Warping looks like the most promising offer.

+1
source

My approach for this is to write the volume of dB at a constant interval (for example, every 100 milliseconds) to save this volume in a list or array. I found a way to do this in java here: Decibel values ​​at specific points in the wav file . This is possible in other languages. Meanwhile, pay attention to the maximum volume:

max = 0; currentVolume = f(x) if currentVolume > max { max = currentVolume } 

Then divide the maximum volume by the editable threshold, in my example I went to 7. Let's say the maximum volume is 21, 21/7 = 3dB, call X at that measure.

The second threshold, such as 1, and multiply it by X. Whenever the volume is greater than this new value (1 * x), we consider this to be the beginning of the word. When it is less than the set value, we consider it to be final.

Visual explanation

0
source

Source: https://habr.com/ru/post/1276004/


All Articles