Center-locked K tool

I am working on a science science project for my entry into the Data Science class, and we decided to solve the problem associated with desalination plants in California: "Where should we put k plants to minimize the distance to zip codes?"

The data that we have so far: zip, city, county, pop, lat, long, the amount of water.

The problem is that I cannot find any resources on how to get the centroid to be limited in order to stay on the coast. So far we have been thinking: Use the usual kmeans algorithm, but move the centroid to the coast when the clusters are installed (bad) Use the usual kmeans algorithm with weights, making coastal zippers infinite weight (I was told that this is not a good solution)

What do you guys think?

+5
source share
2 answers

I would approach this by establishing possible points that could be centers, i.e. your coastline.
I think this is close to Nathaniel Saul's first comment.
Thus, for each iteration, instead of choosing the average value, a point from a possible set will be selected in proximity to the cluster.

Ive simplified the conditions for only two columns of data (bos. And lat.), But you should be able to extrapolate the concept. For simplicity, to demonstrate, I based this on the code here .

In this example, the purple dots are located on the coastline. If I understood correctly, the optimal shoreline locations should look something like this:

Coastline Optimum

See code below:

#! /usr/bin/python3.6 # Code based on: # https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/ import matplotlib.pyplot as plt import numpy as np import random ##### Simulation START ##### # Generate possible points. def possible_points(n=20): y=list(np.linspace( -1, 1, n )) x=[-1.2] X=[] for i in list(range(1,n)): x.append(x[i-1]+random.uniform(-2/n,2/n) ) for a,b in zip(x,y): X.append(np.array([a,b])) X = np.array(X) return X # Generate sample def init_board_gauss(N, k): n = float(N)/k X = [] for i in range(k): c = (random.uniform(-1, 1), random.uniform(-1, 1)) s = random.uniform(0.05,0.5) x = [] while len(x) < n: a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)]) # Continue drawing points from the distribution in the range [-1,1] if abs(a) < 1 and abs(b) < 1: x.append([a,b]) X.extend(x) X = np.array(X)[:N] return X ##### Simulation END ##### # Identify points for each center. def cluster_points(X, mu): clusters = {} for x in X: bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \ for i in enumerate(mu)], key=lambda t:t[1])[0] try: clusters[bestmukey].append(x) except KeyError: clusters[bestmukey] = [x] return clusters # Get closest possible point for each cluster. def closest_point(cluster,possiblePoints): closestPoints=[] # Check average distance for each point. for possible in possiblePoints: distances=[] for point in cluster: distances.append(np.linalg.norm(possible-point)) closestPoints.append(np.sum(distances)) # minimize total distance # closestPoints.append(np.mean(distances)) return possiblePoints[closestPoints.index(min(closestPoints))] # Calculate new centers. # Here the 'coast constraint' goes. def reevaluate_centers(clusters,possiblePoints): newmu = [] keys = sorted(clusters.keys()) for k in keys: newmu.append(closest_point(clusters[k],possiblePoints)) return newmu # Check whether centers converged. def has_converged(mu, oldmu): return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])) # Meta function that runs the steps of the process in sequence. def find_centers(X, K, possiblePoints): # Initialize to K random centers oldmu = random.sample(list(possiblePoints), K) mu = random.sample(list(possiblePoints), K) while not has_converged(mu, oldmu): oldmu = mu # Assign all points in X to clusters clusters = cluster_points(X, mu) # Re-evaluate centers mu = reevaluate_centers(clusters,possiblePoints) return(mu, clusters) K=3 X = init_board_gauss(30,K) possiblePoints=possible_points() results=find_centers(X,K,possiblePoints) # Show results # Show constraints and clusters # List point types pointtypes1=["gx","gD","g*"] plt.plot( np.matrix(possiblePoints).transpose()[0],np.matrix(possiblePoints).transpose()[1],'m.' ) for i in list(range(0,len(results[0]))) : plt.plot( np.matrix(results[0][i]).transpose()[0], np.matrix(results[0][i]).transpose()[1],pointtypes1[i] ) pointtypes=["bx","yD","c*"] # Show all cluster points for i in list(range(0,len(results[1]))) : plt.plot( np.matrix(results[1][i]).transpose()[0],np.matrix(results[1][i]).transpose()[1],pointtypes[i] ) plt.show() 

Edited to minimize overall distance.

+1
source

K-tool does not minimize distances.

It minimizes quadratic errors , which is completely different. The difference is approximately the average, and the average is in one-dimensional data. The error can be massive.

Here is an example counter, assuming we have coordinates:

 -1 0 +1 0 0 -1 0 101 

The center chosen by k-means will be 0.25. The optimal location is 0.0. The sum of the distances by k-value is> 152, the optimal location has a distance of 104. Thus, the center of gravity is almost 50% worse than optimal! But the centroid (= multivariate mean) is what the k-tool uses!

k-tool does not minimize Euclidean distance!

This is one of the options, as the "k-tool is sensitive to emissions."

It doesnโ€™t work if you try to limit it to placing โ€œcentersโ€ only on the coast ...

In addition, you can at least use the Haversin distance, because in California 1 degree north! = 1 degree east, because it is not at the equator.

In addition, you probably should not make the assumption that each location requires its own pipe, but they will be connected like a tree. This significantly reduces the cost.

I highly recommend treating this as a general optimization problem, not k-means. K-tool is also an optimization, but may optimize the wrong function for your problem ...

0
source

Source: https://habr.com/ru/post/1268486/


All Articles