MemoryError in Python when using cKDTree (). Query_ball_tree

I have large 2D arrays with unsorted (X, Y) points, for which I need to know which points are in close proximity to each other (search for nearest neighbors). I used cKDTree and query_ball_tree with the results for arrays with 500,000 (X, Y) points. However, when I try to use the same algorithm for data sets of more than 1,000,000 points, query_ball_tree raises a MemoryError.

I am using 64-bit Windows with 16 GB of internal memory and trying to use the 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints): KDsearch = cKDTree(AllXyPoints) return KDsearch.query_ball_tree(KDsearch, Maxdist) 

My questions:

1) Does anyone know of an alternative to cKDTree / query_ball_tree that consumes less memory? In this case, speed is less important than memory usage.

2) I was hoping that switching from 32-bit to 64-bit python and extensions would solve MemoryError. What could be the reason that this is not so?

Thanks for your help and help.

+4
source share
1 answer

I had a MemoryError with SciPy cKDTree at build time and scikit-learn KDTree when calling .query_radius() . I found that Scikit-learn BallTree had more memory, and using BallTree solved the problem for me. I tested BallTree with 1 million data points on my 64-bit system. It still consumes all available memory (12 GB) and some swap space, but I don't get a MemoryError .

BallTree will not be as fast as KDTree , as your data is 2D, and BallTree slower than KDTree when d <= 3 (see explanation here ). However, given that cKDTree and scikit-learn KDTree both raise MemorError (on my system, anyway), the easiest solution is to use BallTree .

 from sklearn.neighbors import BallTree import numpy as np max_dist = .1 points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points ball_tree = BallTree(points) neighbors = ball_tree.query_radius(points, max_dist) 

Depending on your Maxdist returned result may consume a lot of memory (up to O (n ^ 2)), but scikit-learn BallTree.query_radius() returns np.array from np.array and not list of list so that it np.array you some memory (see this answer for an explanation).

+3
source

Source: https://habr.com/ru/post/1495512/


All Articles