Compare each element in the array with each other

I need to compare a one-dimensional array, in which I need to compare each element of the array with every other element. The array contains a list of strings sorted from longest to shortest. No 2 elements in the array are equal, however there will be elements with the same length. I am currently doing N * (N + 1) / 2 comparisons (127.8 billion), and I'm trying to reduce the number of all comparisons.

I implemented a function that basically says: if the lines differ in length by more than x percent, then do not worry, they are not equal, and the other guys below it are not equal, or just break the loop and go to the next element.

I'm currently trying to reduce this by saying the following: if element A matches elements C and D, then it is reasonable that elements C and D will also match, so don't bother checking them (i.e. skip this operation). This, as far as I remember, since I do not currently know the data structure that will allow me to do this.

The question here is: Does anyone know such a data structure? or does anyone know how i can further reduce my comparisons?

My current implementation is designed for 3.5 days to complete in a window of 10 hours (i.e., too long), and my remaining options are either to reduce the execution time, which may or may not be possible, or to distribute the workload to dozens of systems, which may turn out to be impractical.

Update: . Replace the word equal to match. I calculate the Levenshtein distance

The idea is to find out if there are other rows in the array that exactly match each element of the array. The output is a database mapping of strings that were closely related.

Here is a partial method code. Before executing this code, there is code that loads items into the database.

    public static void RelatedAddressCompute() {
        TableWipe("RelatedAddress");

        decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;

        SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
        _connection.Open();

        string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";

        SqlCommand _dataCommand = new SqlCommand(@"
        SELECT
            COUNT(DISTINCT LevenshteinCache)
        FROM
            Address
        WHERE 
            " + _cacheFilter + @"
        AND
            LEN(LevenshteinCache) > 12", _connection);
        _dataCommand.CommandTimeout = 0;
        int _addressCount = (int)_dataCommand.ExecuteScalar();

        _dataCommand = new SqlCommand(@"
        SELECT 
            Data.LevenshteinCache,
            Data.CacheCount
        FROM            
            (SELECT
                DISTINCT LevenshteinCache,
                COUNT(LevenshteinCache) AS CacheCount
            FROM
                Address
            WHERE 
                " + _cacheFilter + @"
            GROUP BY 
                LevenshteinCache) Data
        WHERE 
            LEN(LevenshteinCache) > 12
        ORDER BY 
            LEN(LevenshteinCache) DESC", _connection);
        _dataCommand.CommandTimeout = 0;
        SqlDataReader _addressReader = _dataCommand.ExecuteReader();

        string[] _addresses = new string[_addressCount + 1];
        int[] _addressInstance = new int[_addressCount + 1];
        int _itemIndex = 1;
        while (_addressReader.Read()) {
            string _address = (string)_addressReader[0];
            int _count = (int)_addressReader[1];
            _addresses[_itemIndex] = _address;
            _addressInstance[_itemIndex] = _count;
            _itemIndex++;
        }

        _addressReader.Close();
        decimal _comparasionsMade = 0;
        decimal _comparisionsAttempted = 0;
        decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
        decimal _percentCompleted = 0;
        DateTime _startTime = DateTime.Now;
        Parallel.For(1, _addressCount, delegate(int i) {
            for (int _index = i + 1; _index <= _addressCount; _index++) {
                _comparisionsAttempted++;
                decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
                if (_percent < _requiredDistance) {
                    decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
                    _comparasionsMade++;
                    if (_difference <= _requiredDistance) {
                        InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
                    }
                }
                else {
                    _comparisionsAttempted += _addressCount - _index;
                    break;
                }
            }
            if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
                InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
            }
            _percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
            TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
            TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
            string _timeRemains = _timeRemaining.ToString();
        });
    }

InsertRelatedAddress is a function that updates the database and contains 500,000 items in the array.

+3
source share
2 answers

OK. , , . . , , . . , < n , - . , , n .

, : .

+1

. ? ? , , 2 , x , , . x ? , , . , - , .

0

Source: https://habr.com/ru/post/1780786/


All Articles