I need to compare a one-dimensional array, in which I need to compare each element of the array with every other element. The array contains a list of strings sorted from longest to shortest. No 2 elements in the array are equal, however there will be elements with the same length. I am currently doing N * (N + 1) / 2 comparisons (127.8 billion), and I'm trying to reduce the number of all comparisons.
I implemented a function that basically says: if the lines differ in length by more than x percent, then do not worry, they are not equal, and the other guys below it are not equal, or just break the loop and go to the next element.
I'm currently trying to reduce this by saying the following: if element A matches elements C and D, then it is reasonable that elements C and D will also match, so don't bother checking them (i.e. skip this operation). This, as far as I remember, since I do not currently know the data structure that will allow me to do this.
The question here is: Does anyone know such a data structure? or does anyone know how i can further reduce my comparisons?
My current implementation is designed for 3.5 days to complete in a window of 10 hours (i.e., too long), and my remaining options are either to reduce the execution time, which may or may not be possible, or to distribute the workload to dozens of systems, which may turn out to be impractical.
Update: . Replace the word equal to match. I calculate the Levenshtein distance
The idea is to find out if there are other rows in the array that exactly match each element of the array. The output is a database mapping of strings that were closely related.
Here is a partial method code. Before executing this code, there is code that loads items into the database.
public static void RelatedAddressCompute() {
TableWipe("RelatedAddress");
decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;
SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
_connection.Open();
string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";
SqlCommand _dataCommand = new SqlCommand(@"
SELECT
COUNT(DISTINCT LevenshteinCache)
FROM
Address
WHERE
" + _cacheFilter + @"
AND
LEN(LevenshteinCache) > 12", _connection);
_dataCommand.CommandTimeout = 0;
int _addressCount = (int)_dataCommand.ExecuteScalar();
_dataCommand = new SqlCommand(@"
SELECT
Data.LevenshteinCache,
Data.CacheCount
FROM
(SELECT
DISTINCT LevenshteinCache,
COUNT(LevenshteinCache) AS CacheCount
FROM
Address
WHERE
" + _cacheFilter + @"
GROUP BY
LevenshteinCache) Data
WHERE
LEN(LevenshteinCache) > 12
ORDER BY
LEN(LevenshteinCache) DESC", _connection);
_dataCommand.CommandTimeout = 0;
SqlDataReader _addressReader = _dataCommand.ExecuteReader();
string[] _addresses = new string[_addressCount + 1];
int[] _addressInstance = new int[_addressCount + 1];
int _itemIndex = 1;
while (_addressReader.Read()) {
string _address = (string)_addressReader[0];
int _count = (int)_addressReader[1];
_addresses[_itemIndex] = _address;
_addressInstance[_itemIndex] = _count;
_itemIndex++;
}
_addressReader.Close();
decimal _comparasionsMade = 0;
decimal _comparisionsAttempted = 0;
decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
decimal _percentCompleted = 0;
DateTime _startTime = DateTime.Now;
Parallel.For(1, _addressCount, delegate(int i) {
for (int _index = i + 1; _index <= _addressCount; _index++) {
_comparisionsAttempted++;
decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
if (_percent < _requiredDistance) {
decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
_comparasionsMade++;
if (_difference <= _requiredDistance) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
}
}
else {
_comparisionsAttempted += _addressCount - _index;
break;
}
}
if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
}
_percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
string _timeRemains = _timeRemaining.ToString();
});
}
InsertRelatedAddress is a function that updates the database and contains 500,000 items in the array.