Remote classification

Question

Remote classification

I have a 3-dimensional array of integers (~ 4000 x 6000 x 3) that I need to classify in a certain way. I hope for something like the k-mean clustering method, but instead of entering the number of clusters, I would like to enter the maximum radius of the cluster.

On the other hand, given the scope of a certain size, I would like to find the minimum number of non-empty spheres that will cover all the data and classify the points accordingly.

Although I know little about this field, I looked a bit for clustering algorithms and did not find what it did.

For example, given a random data set:

import numpy as np randomArray = np.random.rand(10,10,3)*500 Out[8]: array([[[ 256.68932025, 153.07151992, 196.19477623], [ 48.05542231, 346.1289173 , 327.44694932], [ 427.87340594, 197.26882283, 402.41558648], [ 192.50462233, 408.31800086, 81.66016443], [ 64.15373494, 34.96971099, 446.55362458]], [[ 376.55003794, 70.09514697, 242.08947306], [ 194.86207829, 379.90969257, 439.47043484], [ 102.99922513, 98.57769429, 415.5059223 ], [ 464.65318671, 223.60061654, 417.52758666], [ 53.68383153, 205.32517075, 299.83858164]], [[ 364.80957874, 14.26150931, 264.01568428], [ 295.75617954, 107.52678922, 87.89830525], [ 57.90617865, 409.54132373, 54.36940482], [ 217.35951975, 345.7892723 , 301.07031811], [ 295.98999071, 27.17772152, 182.58776469]], [[ 291.69513153, 466.03218019, 279.25794618], [ 179.60152364, 161.64966386, 269.34221176], [ 374.78609278, 259.18286321, 459.8037004 ], [ 458.51249648, 87.05600447, 268.12588338], [ 152.54500603, 472.36773898, 1.15894726]], [[ 35.43731854, 163.42770568, 131.77683448], [ 14.36039625, 390.33409364, 314.56443073], [ 71.47211566, 109.78613335, 345.57021076], [ 74.62340632, 328.51303903, 341.97344285], [ 205.02121677, 235.71812371, 32.91252756]]])

and maximum radius

 R = 35

I would like to output a 2D array with the same height and width, with labels that represent the sphere of each point in the three-dimensional array. At two points with the same label, the Euclidean distance must be greater than the maximum radius, no sphere should be empty, and the algorithm should use the minimum number of spheres available for this.

+5

python arrays numpy

asheets Jan 11 '18 at 10:55

source share

2 answers

Potential approximate solution

I’m not sure that these problems are obedient without any additional restrictions, but if you are ready to live with the approach, I think I can see the solution.

We consider each point in space as a sphere of influence R = 35 . Then you can calculate the matrix (upper triangular) with the distance between all points. If the distance from Pi to Pj is less than R, you consider this point in the sphere of influence. Now you have a set influencing points for each point. The second step is to go over and choose who will be included in the final decision. One of the methods would be to select the most “influential” point, subtract its set of influencing points from other sets, and repeat until you have included everything.

There are several problems with this approach that may or may not be important ...

The center of your sphere may contain more points if it is not located at one of your points (image of 2 points that are 30 units apart). If your data is fairly well distributed or clustered, this can give acceptable results.
I don’t think that an iterative choice of a sphere with the most “influencing” points is guaranteed to give the most optimal solution. (Draw a dense cluster of points with one point right to the right and left. This method will give three spheres where the best solution can include all points with only two spheres.)

If the approximation seems to be something that might work for your project, we can discuss the features of coding a realistic test and try it.

0

bivouac0 Jan 17 '18 at 21:56

source share

Brent · Accepted Answer · 2018-01-17T22:38:34+0000

You can use Affinity-based clustering variation in scikit learn; basic guide , example , a cluster object . You will need to change the metric / distance matrix so that the radius of the sphere is less than the maximum radius. Proximity and Density Clustering (DBSCAN) are the only uncontrolled machine learning methods that do not require a predetermined number of clusters.

Affinity cluster image

The reason I do not recommend DBSCAN is because points that are not part of the cluster. With Affinity clustering, all points will be part of the cluster. Affine clustering can take a long time to converge, O (N ^ 2 * T), where “N” is the number of samples, and “T” is the number of iterations for convergence. The “signal” from Affinity clustering can represent your radius. Since the default affinity , which represents how the signal between points is calculated, is a variation of the Euclidean distance, you have the beginning of the transition when calculating the clusters.

affinity: string, optional, default = 'euclidean'
What an affinity for use. Currently pre-calculated and Euclidean supported. Euclidean uses the negative Euclidean distance squared between points.

In addition, you can weight different points by entering preference . When you repeat / converge on the number of clusters, you will see the affinity_matrix_ and cluster_centers_ attributes to determine if all points are within your maximum radius. The preference for each point may change as you repeat the different clustering options to indicate which cluster is the most likely part.

preference: array, form (n_samples) or float, optional
Preferences for each point - points with large preference values are likely to be selected as examples. The number of instances, i.e. clusters, depends on the value of input preferences. If preferences are not passed as arguments, they will be set to the medians of the input affinities.

If you want to try DBSCAN, the three main indicators are EPS (maximum distance between two samples), minimum number of points to create a cluster, and distance metric DBSCAN Cluster Object , a basic guide , example .

DBSCAN Clustering Image

You may not have to iterate as much through DBSCAN as with Affinity. You can quickly test if the selected maximum distance is viable, since the cluster label in DBSCAN will be -1 if the sample is not part of the cluster at that distance. Memory consumption will be high with large data sets.

There are three types of points / samples in the DBSCAN algorithm:

Kernel sample:

a sample in a dataset such that min_samples exist other samples at a distance of eps

Fringe example:

- These are samples that are neighbors with a sample of a nucleus in a cluster, but are not samples of a nucleus themselves; the translation of this sample is within eps but does not have a min_samples environment (less than eps distance)

Outlier:

not a core sample and at least eps at a distance from any core

DBSCAN clusters are defined only by core samples. This may be misleading, but scikit learn really explains the algorithm well.

I do not think that there is an easy implementation of your problem. Both Affinity and DBSCAN have their pro and con. I think Affinity would be your best bet initially, because DBSCAN can get confused and give you a memory error depending on the size of your sample matrix.

Hope this helps. This is a very interesting question. I would try the Affinity option to solve this problem initially.

Remote classification

More articles: