Regarding name comparisons, you can take a look at the Levenshtein distance algorithm. Given two lines, he will calculate a distance measurement that can be used as a basis for catching duplicates.
I personally used it in a tool that I developed for an application with a fairly large database, in which there were a large number of duplicates. Using it in combination with some other data mappings related to my domain, I was able to specify my tool in the application database and quickly find many duplicate records. Not going to lie, I thought it was great to look in action.
It is even quick to implement, here is the C # version :
public int CalculateDistance(string s, string t) { int n = s.Length; //length of s int m = t.Length; //length of t int[,] d = new int[n + 1, m + 1]; // matrix int cost; // cost // Step 1 if (n == 0) return m; if (m == 0) return n; // Step 2 for (int i = 0; i <= n; d[i, 0] = i++) ; for (int j = 0; j <= m; d[0, j] = j++) ; // Step 3 for (int i = 1; i <= n; i++) { //Step 4 for (int j = 1; j <= m; j++) { // Step 5 cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1); // Step 6 d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } // Step 7 return d[n, m]; }
source share