You should be aware that this is a very complex problem, and if you do not have a background in signal processing (or interest in communicating with it), then you have a very frustrating time in front of you. If you expect to throw a couple of FFT into a problem, you will not get very far. Hope you have some interest as this is a really exciting area.
First, the problem of tone recognition arises, which is easy enough to do for simple monophonic instruments (for example, voice), using a method such as autocorrelation or a spectrum of harmonic sums (for example, see Paul R link). However, you will often find that it gives the wrong results: you often get the half or double height that you expect. This is called doubling the period of pitch or octave errors, and this is essentially because FFT or autocorrelation has the assumption that the data has constant characteristics over time. If you have a human-performed instrument, there will always be some kind of variation.
Some people approach the problem of key recognition as first performing pitch recognition and then finding the key from a sequence of steps. This is incredibly difficult if you have something other than a monophonic sequence of resins. If you have a monophonic sequence of pitches, then it is still not a clear method for determining the key: how do you feel about chromatic notes, for example, or determine whether it is primary or secondary. Therefore, you will need to use a method similar to the krumhansl key search algorithm .
So, given the complexity of this approach, an alternative is to view all notes that play at the same time. If you have chords or more than one instrument, then you will have a rich spectral soup of many sinusoids playing simultaneously. Each individual note consists of several harmonics of the fundamental frequency, therefore A (at a frequency of 440 Hz) will consist of sine waves at 440, 880, 1320 ... Also, if you play E (see the diagram for steps), then this is 659, 25 Hz, which is almost one and a half times more than that of A (actually 1,498). This means that every third harmonic of A coincides with every second harmonic of E. This is the reason why chords sound pleasant because they separate harmonics. (as an aside, the whole reason Western harmony works is related to the quirk of fate that the twelfth root of 2 to degree 7 is almost equal to 1.5)
If you look at this interval from 5 to major, minor and other chords, you will find other factors. I think that many key search methods will list these relationships and then fill in a histogram for each spectral peak in the signal. Therefore, if you find an A5 chord, you expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5, it will be 494, 988, 741, etc. Therefore, create a frequency histogram and for each a sinusoidal peak in the signal (for example, from the FFT power spectrum) increases the histogram input. Then, for each AG key, highlight the bins in your histogram, and those with the most entries will most likely be your key.
This is just a very simple approach, but it may be enough to find a key to a strummed or long lasting chord. You will also have to interrupt the signal at short intervals (for example, 20 ms) and analyze each to create a more reliable estimate.
EDIT:
If you want to experiment, I would suggest downloading a package, such as Octave or CLAM , which simplifies the visualization of audio data and the performance of FFT and other operations.
Other useful links:
- My thesis on some aspects of key recognition is a little hard math, but chapter 2 (I hope) is an affordable introduction to various approaches to modeling musical audio.
- http://en.wikipedia.org/wiki/Auditory_scene_analysis - Bregman Audory A script that, although not talking about music, has some fascinating insights into how we perceive complex scenes
- Dan Ellis has done some great work in this and similar areas.
- Keith Martin has some interesting approaches.