K-Nearest-Neighbor construction speed with SciKit-learn and SciPy

I have a large set of two-dimensional points and you want to quickly request a set for the k-nearest neighbors of any point in 2-dimensional space. Since it is low-sized, KD-Tree seems like a good way to do this. My original dataset will be updated very rarely, so the time to query the point should be more important to me than the build time. However, every time I run the program, I will need to reload the object, so I also need a structure that can be quickly saved and reloaded.

Two options are available: KDTree structures in SciPy and SciKit-learn. Below I describe two of them for build speed and query speed over a wide range of list lengths. I also collect the SciKit-learn structure and show the time to reload the object from the brine. They are compared on a graph, and the code used to generate the timings is shown below.

As I will show in the graph, loading from the brine is faster than building it from zero to half the order of large N, showing that KDTree is suitable for my use (i.e. frequent reloads, but infrequent reassemblies).

Comparing build-, reload- and query-time of two KD-Tree structures

Code for comparing build time:

# Profiling the building time for the two KD-tree structures and re-loading from a pickle import math, timeit, pickle, sklearn.neighbors the_lengths = [100, 1000, 10000, 100000, 1000000] theSciPyBuildTime = [] theSklBuildTime = [] theRebuildTime = [] for length in the_lengths: dim = 5*int(math.sqrt(length)) nTimes = 50 from random import randint listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)] setup = """import scipy.spatial import sklearn.neighbors length = """ + str(length) + """ dim = """ + str(dim) + """ from random import randint listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]""" theSciPyBuildTime.append( timeit.timeit('scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20)', setup=setup, number=nTimes)/nTimes ) theSklBuildTime.append( timeit.timeit('sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20)', setup=setup, number=nTimes)/nTimes ) theTreeSkl = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20, metric='euclidean') f = open('temp.pkl','w') temp = pickle.dumps(theTreeSkl) theRebuildTime.append( timeit.timeit('pickle.loads(temp)', 'from __main__ import pickle,temp', number=nTimes)/nTimes ) 

Code for comparing request time:

 # Profiling the query time for the two KD-tree structures import scipy.spatial, sklearn.neighbors the_lengths = [100, 1000, 10000, 100000, 1000000, 10000000] theSciPyQueryTime = [] theSklQueryTime = [] for length in the_lengths: dim = 5*int(math.sqrt(length)) nTimes = 50 listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)] setup = """from __main__ import sciPiTree,sklTree from random import randint length = """ + str(length) + """ randPoint = [randint(0,""" + str(dim) + """),randint(0,""" + str(dim) + """)]""" sciPiTree = scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20) sklTree = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20) theSciPyQueryTime.append( timeit.timeit('sciPiTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes ) theSklQueryTime.append( timeit.timeit('sklTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes ) 

Questions:

  • Result . Although they are approaching very large N, SciKit-learn seems to be superior to SciPy for both build time and query time. Ask other people to find this?

  • Math Are there any better structures for this? I work only in 2D space (although the data will be quite dense, so brute force is missing), is there a better structure for low-dimensional kNN searches?

  • Speed . It looks like the build time for the two approaches is approaching big N, but my computer abandoned me - can someone check this out for me for bigger N ?! Thank you !! Recovery time continue to increase approximately linearly?

  • Practicality : SciPy KDTree will not dry out. As reported in this post , I received the following error: PicklingError: Can not pickle: it was not found as scipy.spatial.kdtree.innernode "- I think this is due to the fact that this is a nested structure. According to the answer provided in this post , nested structures can be pickled with dill. However, dill gives me the same error - why is it?

+6
source share
1 answer

I suggest trying the Gaussian mixing models from SciKit-learn for such a problem. Since your data is 2-dimensional, it should work correctly.

0
source

Source: https://habr.com/ru/post/987888/


All Articles