Comparison of music data

I am looking for theory, algorithms, etc. in order to compare music. In particular, I am studying how to duplicate music tracks that have different bitrates or, possibly, slightly different options (a radio version with an album version), but otherwise they sound the same.

Cases for this include services such as Grooveshark, Youtube, etc., where they receive a lot of duplicate tracks. I am also interested in textual comparisons (Britney Spears and Britney Spears, how much they deviate, etc.), although this is secondary, and I already have some sources that can be continued in this area.

I am mainly interested in methods and algorithms for comparing agnostic codecs (assuming a "raw" stream), but resources specific to the codec are evaluated.

I know projects like musicbrainz.org, but I havenโ€™t researched it further and would be interested if such projects could help in this endeavor.

+4
source share
2 answers

I wrote a similar answer here: Music recognition and signal processing .

In the research community, the problem of finding similarities between two signals (up to environmental distortions such as noise, moderate variations in tempo, pitch or bit rate) is known as a sound (or musical) fingerprint . This topic has been studied for at least a decade. This early (and often cited) article by Haitsma and Kalker clearly describes the problem and offers a simple solution.

The problem of finding musical similarities between two versions of the same song is called cover identification . This problem is also being studied to a large extent, but is still considered open.

Perhaps the two most popular commercial content-based music search solutions are: Midomi and Shazam .

I think this is relevant to your question. Check out Google Scholar for recent solutions to these problems. ISMIR materials are available for free online.

+1
source

Regarding name comparisons, you can take a look at the Levenshtein distance algorithm. Given two lines, he will calculate a distance measurement that can be used as a basis for catching duplicates.

I personally used it in a tool that I developed for an application with a fairly large database, in which there were a large number of duplicates. Using it in combination with some other data mappings related to my domain, I was able to specify my tool in the application database and quickly find many duplicate records. Not going to lie, I thought it was great to look in action.

It is even quick to implement, here is the C # version :

public int CalculateDistance(string s, string t) { int n = s.Length; //length of s int m = t.Length; //length of t int[,] d = new int[n + 1, m + 1]; // matrix int cost; // cost // Step 1 if (n == 0) return m; if (m == 0) return n; // Step 2 for (int i = 0; i <= n; d[i, 0] = i++) ; for (int j = 0; j <= m; d[0, j] = j++) ; // Step 3 for (int i = 1; i <= n; i++) { //Step 4 for (int j = 1; j <= m; j++) { // Step 5 cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // Step 6 d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } // Step 7 return d[n, m]; } 
+3
source

Source: https://habr.com/ru/post/1309568/


All Articles