Most mutually deleted k elements (clustering?)

I have a simple question for machine learning:

I have n (~ 110) elements and a matrix of all pairwise distances. I would like to select 10 items that are far apart. That is, I want

Maximize: Choose 10 different elements. Return min distance over (all pairings within the 10). 

My distance metric is symmetrical and obeys the triangle inequality.

What algorithm can i use? My first instinct is this:

  • Clustering of n elements into 20 clusters.
  • Replace each cluster with only the element of this cluster that is the farthest from the middle element of the original n.
  • Use brute force to challenge the remaining 20 candidates. Fortunately, 20 choose 10 only 184,756.

Edit: thanks to the insightful reference commentary, the “Returned amount (of distances)” has changed to “Return minimum distance” in the statement about the optimization problem.

+4
source share
2 answers

Here you can approach this problem of combinatorial optimization by taking convex relaxation.

Let D be an upper triangular matrix with your distances on the upper triangle. That is, where I <j, D_i, j is the distance between the elements i and j. (Presumably you will also have zeros diagonally.)

Then your goal is to maximize x '* D * x, where x is binary with 10 elements set to 1 and the rest are 0. (Setting the i-th entry to x by 1 is similar to choosing the i-th element as one of your 10 items.)

The “standard” convex optimization associated with a combinatorial problem like this one is to relax the constraints so that x does not need to be discretely evaluated. This gives us the following problem:

maximize y '* D * y under the condition: 0 <= y_i <= 1 for all i, 1' * y = 10

This is a (morally) quadratic program. (If we replace D with D + D ', it will become a bona fide quadratic program, and you should not be different.) You can use the ready-made QP-solver or just connect it to the convex optimization optimizer of your choice (for example, cvx) .

You should not (and probably will not) be a binary vector, but you can convert scalar values ​​to discrete values ​​in several ways. (The simplest is probably to let x be 1 in 10 entries, where y_i is the highest, but you may need to do something more complex.) In any case, y '* D * y with the y that you exit gives you estimate the upper bound for the optimal value x '* D * x, so if x built from y has x' * D * x very close to y '* D * y, you can be very happy with your approximation.

Let me know if this is unclear, notation or otherwise.

+5
source

Good question.

I am not sure that this can be solved exactly, and your cluster solution seems reasonable. Another area that should be addressed is the local search method, such as simulated annealing and rock climbing.

Here's the obvious baseline, I would compare any other solution against:

  • Repeat 100 times:
  • Firmly select a datapoint whose removal reduces the objective function and removes it.
+2
source

Source: https://habr.com/ru/post/1344872/


All Articles