Interpolate unstructured X, Y, Z data over the best grid based on the closest neighboring distance for each point

This question has been edited after answers for the final show solution I used.

I have unstructured 2D datasets coming from different sources, for example, for example: Data Example 1: 3D Measurement Sample Data 2: 2D Grid Nodes These data sets are 3 numpy.ndarray (X, Y, and Z coordinates).

My ultimate goal is to interpolate abstract data on the grid for conversion to image / matrix. Therefore, I need to find the "best grid" for interpolating these abstracts. And for this I need to find the best X and Y steps between the pixels of this grid.

Determine the step based on the Euclidean distance between the points:

Use the average of the Euclidean distances between each point and the nearest neighbor.

  • Use KDTree / cKDTree from scipy.spacial to build the X, Y data tree.
  • Use the query method with k=2 to get the distances (if k=1 , the distances are zero, because the query is for every point found).
 # Generate KD Tree xy = np.c_[x, y] # X,Y data converted for use with KDTree tree = scipy.spacial.cKDTree(xy) # Create KDtree for X,Y coordinates. # Calculate step distances, points = tree.query(xy, k=2) # Query distances for X,Y points distances = distances[:, 1:] # Remove k=1 zero distances step = numpy.mean(distances) # Result 

Performance tuning:

  • Using scipy.spatial.cKDTree , not scipy.spatial.KDTree , because it is really faster.
  • Use balanced_tree=False with scipy.spatial.cKDTree : Great speed in my case, but may be incorrect for all data.
  • Use n_jobs=-1 with cKDTree.query to use multithreading.
  • Use p=1 with cKDTree.query to use Manhattan distance instead of Euclidean distance ( p=2 ): faster, but may be less accurate.
  • Request distance only for random subsampling of points: High speed with large data sets, but may be less accurate and less repeatable.

Interpolate points on the grid:

Interpolate the dataset points on the grid using the calculated step.

 # Generate grid def interval(axe): '''Return numpy.linspace Interval for specified axe''' cent = axe.min() + axe.ptp() / 2 # Interval center nbs = np.ceil(axe.ptp() / step) # Number of step in interval hwid = nbs * step / 2 # Half interval width return np.linspace(cent - hwid, cent + hwid, nbs) # linspace xg, yg = np.meshgrid(interval(x), interval(y)) # Generate grid # Interpolate X,Y,Z datas on grid zg = scipy.interpolate.griddata((x, y), z, (xg, yg)) 

Set NaN if the pixel is too far from the starting points:

Set NaN to pixels from the grid that are too far (Distance> step) from points from the original data X, Y, Z. The previous KDTree created is used.

 # Calculate pixel to X,Y,Z data distances dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()]) dist = dist.reshape(xg.shape) # Set NaN value for too far pixels zg[dist > step] = np.nan 
+5
source share
2 answers

I suggest you go with KDTree.query .

You are looking for the distance from scamming to the scale of your binning: I suggest that you take only a random subset of your points and use the Manhattan distance, because KDTree.query very slow (and yet this is a * log (n) of complexity).

Here is my code:

 # CreateTree tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy? # Create random subsample of points n_repr=1000 shuffled_points=numpy.array(points) numpy.random.shuffle(shuffled_points) shuffled_points=shuffled_points[:n_repr] # Query the tree (dists,points)=tree.query(shuffled_points,k=2,p=1) # Get _extimate_ of average distance: avg_dists=numpy.average(dists) print('average distance Manhattan with nearest neighbour is:',avg_dists) 

I suggest you use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it computes faster than the Euclidean distance. And since you only need an estimate of the average distance, it should be sufficient.

0
source

The problem you want to solve is called the all-closest neighbors problem. See this article, for example: http://link.springer.com/article/10.1007/BF02187718

I believe that the solutions for this are O (N log N), therefore in the same order as KDTree.query, but in practice it is much faster than a bunch of separate queries. Sorry, I don't know about the python implementation of this.

+2
source

Source: https://habr.com/ru/post/1240698/


All Articles