Select a subset of the farthest points in a given set of points

Imagine you are given a set S of n points in three dimensions. The distance between any two points is a simple Euclidean distance. You want to select a subset of Q from k points from this set so that they are farthest from each other. In other words, there is no other subset Q of k points such that min of all pairwise distances in Q is less than in Q.

If n is about 16 million and k is about 300, how do we effectively do this?

I guess this NP is hard, maybe we just want to focus on the approximation. One idea that I can come up with is to use multidimensional scaling to sort these points in a row, and then use the binary search version to get the points farthest from each other on this line.

+5
source share
2 answers

I am also sure that the problem is NP-Hard, the most similar problem I found is k-Center Problem . If runtime is more important than correctness, a greedy algorithm is probably your best bet:

Q ={} while |Q| < k Q += p from S where mindist(p, Q) is maximal 

Note: for similar problems, for example, a problem with the cover , you can show that the solution from the greedy algorithm is at least 63% and the optimal solution.

To speed things up, I see 3 possibilities:

  • Index your dataset in R-Tree first , then do a greedy search. Building an R-tree is O (n log n), but although it is designed to find the closest neighbor, it can also help you find the farthest point for the set of points in O (log n). It can be faster than the naive O (k * n) algorithm.

  • Choose a subset of your 16 million points and execute the greedy algorithm on the subset. You get close anyway so you can save a little more accuracy. You can also combine this with algorithm 1.

  • Use an iterative approach and stop when you don't have time. The idea here is to randomly select k points from S (allows us to call this set Q '). Then at each step you switch the point p_ from Q ', which has a minimum distance to another in Q' with a random point from S. If the resulting set Q '' is better to continue Q '', otherwise repeat with Q ', so as not to get stuck , you can choose a different point from Q 'than p_ if you could not find an adequate replacement for several iterations.

+3
source

If you can afford to do ~ k * n distance calculations, you could

  • Find the center of the distribution of points.
  • Select the point farthest from the center. (and remove it from the set of points not selected).
  • Find the point that is far from all the currently selected points, and select it.
  • Repeat 3. until you're done with k dots.
0
source

Source: https://habr.com/ru/post/1275510/


All Articles