Algorithms for determining a sound sample key

I'm interested in determining the musical key of an audio sample. How (or can) the algorithm try to try to approximate the key of a musical sample?

Antares Autotune and Melodyne are two pieces of software that do such things.

Can someone give a little clarification on how this will work? Mathematically derive a key from a song by analyzing the frequency spectrum for chordal progressions, etc.

This topic interests me very much!

Change - brilliant sources and a lot of information that can be found in everyone who contributed to this issue .

Specially from: the_mandrill and Daniel Brückner.

+43
algorithm analysis sampling audio audio-processing
Jun 29 '10 at 14:58
source share
8 answers

You should be aware that this is a very complex problem, and if you do not have a background in signal processing (or interest in communicating with it), then you have a very frustrating time in front of you. If you expect to throw a couple of FFT into a problem, you will not get very far. Hope you have some interest as this is a really exciting area.

First, the problem of tone recognition arises, which is easy enough to do for simple monophonic instruments (for example, voice), using a method such as autocorrelation or a spectrum of harmonic sums (for example, see Paul R link). However, you will often find that it gives the wrong results: you often get the half or double height that you expect. This is called doubling the period of pitch or octave errors, and this is essentially because FFT or autocorrelation has the assumption that the data has constant characteristics over time. If you have a human-performed instrument, there will always be some kind of variation.

Some people approach the problem of key recognition as first performing pitch recognition and then finding the key from a sequence of steps. This is incredibly difficult if you have something other than a monophonic sequence of resins. If you have a monophonic sequence of pitches, then it is still not a clear method for determining the key: how do you feel about chromatic notes, for example, or determine whether it is primary or secondary. Therefore, you will need to use a method similar to the krumhansl key search algorithm .

So, given the complexity of this approach, an alternative is to view all notes that play at the same time. If you have chords or more than one instrument, then you will have a rich spectral soup of many sinusoids playing simultaneously. Each individual note consists of several harmonics of the fundamental frequency, therefore A (at a frequency of 440 Hz) will consist of sine waves at 440, 880, 1320 ... Also, if you play E (see the diagram for steps), then this is 659, 25 Hz, which is almost one and a half times more than that of A (actually 1,498). This means that every third harmonic of A coincides with every second harmonic of E. This is the reason why chords sound pleasant because they separate harmonics. (as an aside, the whole reason Western harmony works is related to the quirk of fate that the twelfth root of 2 to degree 7 is almost equal to 1.5)

If you look at this interval from 5 to major, minor and other chords, you will find other factors. I think that many key search methods will list these relationships and then fill in a histogram for each spectral peak in the signal. Therefore, if you find an A5 chord, you expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5, it will be 494, 988, 741, etc. Therefore, create a frequency histogram and for each a sinusoidal peak in the signal (for example, from the FFT power spectrum) increases the histogram input. Then, for each AG key, highlight the bins in your histogram, and those with the most entries will most likely be your key.

This is just a very simple approach, but it may be enough to find a key to a strummed or long lasting chord. You will also have to interrupt the signal at short intervals (for example, 20 ms) and analyze each to create a more reliable estimate.

EDIT:
If you want to experiment, I would suggest downloading a package, such as Octave or CLAM , which simplifies the visualization of audio data and the performance of FFT and other operations.

Other useful links:

  • My thesis on some aspects of key recognition is a little hard math, but chapter 2 (I hope) is an affordable introduction to various approaches to modeling musical audio.
  • http://en.wikipedia.org/wiki/Auditory_scene_analysis - Bregman Audory A script that, although not talking about music, has some fascinating insights into how we perceive complex scenes
  • Dan Ellis has done some great work in this and similar areas.
  • Keith Martin has some interesting approaches.
+50
Jun 29 '10 at 16:39
source share

I worked on the problem of rewriting polyphonic CDs for decades at the university. The problem, as you know, is a serious one. The first scientific articles related to the problem date back to the 1940s and to this day do not have reliable solutions for the general case.

All the basic assumptions that you usually read are not entirely correct, and most of them are quite erroneous, which become unsuitable for everything except very simple scenarios.

Overtone frequencies are not multiples of the fundamental frequency — there are non-linear effects, so that high particles move away from the expected frequency — and not just a few Hertz; it is not unusual to find the 7th particle where you expected the 6th.

The Fourier transforms do not play well with audio analysis, because the frequencies of interest are logarithmically deferred, and the Fourier transform produces linearly spaced frequencies. At low frequencies, you need a high-frequency resolution to separate adjacent fields, but this gives a poor temporal resolution, and you lose the ability of individual notes played in quick succession.

Sound recording (possibly) does not contain all the information needed to restore the score. Most of our perception of music happens in our ears and brain. That is why some of the most successful systems are expert systems with large repositories of knowledge about the structure of (western) music, which only rely on a small part of signal processing to extract information from audio recordings.

When I get home, I will look at the documents that I have read and select the 20 or 30 most important ones and add them here. I really suggest reading them before you decide to implement something - as indicated above, if most of the common assumptions are somewhat incorrect, and you really do not want to rediscover all these things found and analyzed for more than 50 years when implemented and tested.

This is a difficult problem, but it is also a lot of fun. I would love to hear what you tried and how well it worked.




Now you can take a look at the distribution kit Constant Q, Cepstrum and Wigner (-Ville). There are also some good articles on how to extract the frequency from the shifts in the phase of short-term Fourier spectra - this allows the use of very short window sizes (at high temporal resolution), since the frequency can be determined with an accuracy of 1000 times greater than the frequency resolution of the basic Fourier transform .

All these transformations approach the problem of sound processing much better than conventional Fourier transforms. To improve the results of basic transformations, we consider the concept of reassignment of energy.

+16
Jun 29 '10 at 17:50
source share

As far as I can judge from this article , each of them has its own common frequencies, so it probably analyzes the sound sample to find out what are the most common notes and chords. In the end, you can have several keys that have the same configuration of sharp objects and apartments, the difference being that the key starts and thus the chords are such keys, so it seems that important notes and chords often appear the only the real way you could understand this. In fact, I do not think that you can get an explanation of simple mathematical formulas without leaving a lot of information.

Please note that this comes from someone who has absolutely no experience in this area, and his first acquaintance is the article related in this answer.

+5
Jun 29 '10 at 15:16
source share

You can use the Fourier transform to calculate the frequency spectrum from a sample of sound. From this output, you can use the frequency values for specific notes to turn this into a list of notes that were heard during the sample. Choosing the strongest notes heard over a sample from a series of samples, you should give a decent map of various notes that you can compare with different musical scales to get a list of possible scales containing this combination of notes.

To decide which specific scale is used, take a note (not a pun intended) for the most frequently heard notes. In Western music, the root of the scale is usually the most common note, followed by the fifth and then the fourth. You can also search for patterns such as common chords , arpeggios , or progressions .

Sample size is likely to be important here. Ideally, each sample will be one note (so you won’t get two chords in one example). If you filter and concentrate on low frequencies, you can use volume (“clicks”), usually associated with percussion instruments to determine the tempo of a song and “lock” your algorithm to the beat of the music. Start with samples that are double in length and will adjust from there. Be prepared to throw away some samples that do not contain a lot of useful data (for example, a sample taken in the middle of a slide).

+5
Jun 29 '10 at 15:30
source share

This is a complex topic, but a simple algorithm for determining one key (a separate note) will look like this:

Do the Fourier transform to say that 4096 samples (the exact size depends on your resolution requirements) are on the part of the sample that contains the note. Determine the peak power in the spectrum - this is the frequency of the notes.

Everything becomes more crowded if you have a chord, different “instruments / effects” or a non-homophonic musical pattern.

+2
Jun 29 '10 at 15:21
source share

First you need an algorithm for determining the pitch (for example, autocorrelation ).

You can then use the pitch algorithm to extract the pitch within a few short time windows. After that, you will need to see which musical key is best suited for the selected fields.

+1
Jun 29 '10 at 15:30
source share

If you need to classify a bunch of songs right now, then the crowd is the source of the problem with something like Mechanical Turk .

+1
Jun 29 '10 at 21:06
source share

Key analysis is not the same as field analysis. Unfortunately, the whole concept of the key is somewhat ambiguous, different definitions usually tend to share only the concept of the tonic, i.e. Central tone / chord. Even if there was a good system for automatic transcription, there is no reliable algorithm for determining the key.

+1
Jul 22 '15 at 18:03
source share



All Articles