Compare two spectrograms to find the offset where they fit the algorithm

I record a daily 2 minute broadcast from the Internet. There is always the same initial and final ringing. Since the exact time of the broadcast can vary from more or less 6 minutes, I have to record about 15 minutes of radio.

I want to determine the exact time when these jingles are recorded in 15 minutes, so I can extract the part of the audio that I want.

I already started a C # application where I decoded MP3 data for PCM and converted PCM data to a spectrogram based on http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx

I tried to use the Cross Correlation algorithm for PCM data, but the algorithm is very slow for about 6 minutes in 10 ms increments and in some cases it cannot find the start time of the call.

Any ideas on algorithms comparing two spectrograms to match? Or the best way to find the start time of a call?

Thanks,

Refresh, sorry for the delay

Firstly, thanks to all of them who were the majority of them, were interesting and interesting ideas.

I tried to implement the Shazam algorithm proposed by fonzo. But it was not possible to detect peaks in the spectrogram. There are three spectrograms of the initial call from three different records. I tried AForge.NET with a blob filter (but it could not identify the peaks) to blur the image and check the difference in height, Laplace convolution, slope analysis to detect a series of vertical bars (but there were too many false positives) ...

At the same time, I tried the Hugh algorithm proposed by Dave Aaron Smith. Where I calculate the RMS for each column. Yes, yes, each column is O (N * M), but M <N (Note that the column is about 8k samples). So in general, this is not so bad, yet the algorithm takes about 3 minutes, but never fails.

I could go with this solution, but if possible, I would prefer Shazam to call it O (N) and probably much faster (and cooler). Thus, any of you have the idea that the algorithm always detects the same points in these spectrograms (it does not have to be peaks), thanks to the addition of a comment.

FFT Start Jingle 1

FFT Start Jingle 2

FFT Start Jingle 3

New update

Finally, I went with the algorithm described above, I tried to implement the Shazam algorithm, but could not find the correct peaks in the spectrogram, identified points that were not constant from one sound file to another. Theoretically, the Shazam algorithm is a solution to this problem. The Hough algorithm proposed by Dave Aaron Smith was more stable and efficient. I divided about 400 files, and only 20 of them could not be split correctly. Disk space when from 8 GB to 1 GB.

Thank you for your help.

+4
source share
4 answers

I wonder if you can use Hough transform . You will begin by cataloging each step of the opening sequence. Let's say you use 10 ms steps and the opening sequence is 50 ms. You calculate some indicators at every step and get

1 10 1 17 5 

Now go through the audio and analyze each 10 ms step for the same metric. Call this have_audio array

 8 10 8 7 5 1 10 1 17 6 2 10... 

Now create a new empty array with the same length as have_audio . Name it start_votes . It will contain “voices” to begin the opening sequence. If you see 1, you can be at the first or third step of the opening sequence, so you have 1 vote for the initial sequence starting from the 1st step back, and 1 vote for the initial sequence starting 3 steps back. If you see 10, you have 1 vote for the initial sequence, starting 2 steps back, 17 votes for 4 steps back, etc.

So, for this have_audio example have_audio your votes will look like

 2 0 0 1 0 4 0 0 0 0 0 1 ... 

You have a lot of votes in position 6, so there is a good chance that the initial sequence starts there.

You can improve performance without trying to analyze the entire opening sequence. If the opening sequence is 10 seconds, you can simply find the first 5 seconds.

+2
source

This describes the algorithm used by the shazam service (which identifies music based on a short, possibly noisy sample): http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
From what I understand, the first thing to do is isolate the peaks in the spectrogram (with some settings to ensure uniform coverage), which will give a "constellation" of a pair of values ​​(time, frequency) from the original spectrogram. After that, the constellation sample is compared with the full track constellation, moving the window of the sample length from beginning to end and counting the number of correlated points.
Then the document describes the technical solution, which, in their opinion, allows you to quickly perform comparisons even with a huge collection of tracks.

+4
source

Here is a nice python package that does just that:

https://code.google.com/p/py-astm/

If you are looking for a specific algorithm, use the appropriate search terms: “acoustic fingerprint” or “perceptual hashing”.

You can also use another python package here:

http://rudd-o.com/new-projects/python-audioprocessing/documentation/manuals/algorithms/butterscotch-signatures

+2
source

If you already know the call sequence, you can analyze the correlation with the sequence instead of the cross-correlation between all 15-minute tracks.

To quickly calculate the correlation with a (short) sequence, I would suggest using a Wiener filter .

Edit: The Wiener filter is a way of detecting a signal in sequence with noise. In this application, we consider everything that “does not ring” as noise (a question for the reader: can we assume that the noise is white and not correlated?).

( I found the link I was looking for! The formulas that I remember were a bit off, and I will delete them now )

Relevant page deconvolution Wiener . The idea is that we can determine a system whose impulse response h(t) has the same waveform as the jingle, and we must find the point in the noisy sequence in which the system received the impulse (i.e., emitted jingje) .

Since fame is known, we can calculate its power spectrum H(f) , and since we can assume that a single ring appears in the recorded sequence, we can say that the unknown input x(t) has the shape of a pulse whose power density is S(f) constant at each frequency.

Given the above knowledge, you can use the formula to obtain the "jingle-pass" filter (such as only ring-shaped signals), the output of which is the highest when playing a call.

+1
source

Source: https://habr.com/ru/post/1347936/


All Articles