What is the best way to implement K-nearest neighbors in C # for a large number of dimensions?

I implement a classification algorithm for K-nearest neighbors in C # for a set for training and testing about 20,000 samples each and 25 measurements.

In my implementation, there are only two classes represented by "0" and "1". At the moment, I have the following simple implementation:

// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
    Console.WriteLine("Performing KNN with K = "+K);

    var testResults = new int[testSamples.Count()]; 

    var testNumber = testSamples.Count();
    var trainNumber = trainSamples.Count();
    // Declaring these here so that I don't have to 'new' them over and over again in the main loop, 
    // just to save some overhead
    var distances = new double[trainNumber][]; 
    for (var i = 0; i < trainNumber; i++)
    {
       distances[i] = new double[2]; // Will store both distance and index in here
    }

    // Performing KNN ...
    for (var tst = 0; tst < testNumber; tst++)
    {
        // For every test sample, calculate distance from every training sample
        Parallel.For(0, trainNumber, trn =>
        {
            var dist = GetDistance(testSamples[tst], trainSamples[trn]);
            // Storing distance as well as index 
            distances[trn][0] = dist;
            distances[trn][1] = trn;
        });

        // Sort distances and take top K (?What happens in case of multiple points at the same distance?)
        var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);

        // Do a 'majority vote' to classify test sample
        var yea = 0.0;
        var nay = 0.0;

        foreach (var voter in votingDistances)
        {
            if (trainClasses[(int)voter[1]] == 1)  
               yea++;
            else
               nay++;
        }
        if (yea > nay)
            testResults[tst] = 1;
        else
            testResults[tst] = 0;

    }

    return testResults;
}

// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
    var distance = 0.0;
    // assume sample1 and sample2 are valid i.e. same length 

    for (var i = 0; i < sample1.Count; i++)
    {   
        var temp = sample1[i] - sample2[i];
        distance += temp * temp;
    }
    return distance;
}

It takes quite a while to complete. It takes about 80 seconds on my system. How can I optimize this, ensuring that it also scales to more sample data? As you can see, I tried using PLINQ and parallel for loops, which helped (without them, it took about 120 seconds). What else can I do?

, KD- KNN , , .

qaru.site/questions/394106/... , , 3 , , - .

#, R C #, , , , , . , .

- PCA - . 25 .

+4
1

, , , , . . dotTrace profiler; Visual Studio . , , .

, :

  • . ? , , (. ), "Parallel.For".

  • , IList. "GetDistance()".

  • K ? "" K, K , /selection, , SortedSet , K.

+3

Source: https://habr.com/ru/post/1547464/


All Articles