Python k-means algorithm

Question

Python k-means algorithm

I am looking for a Python implementation of the k-mean algorithm with examples for clustering and caching my coordinate database.

+46

python algorithm cluster-analysis k-means

Eeyore 09 Oct '09 at 19:16

source share

9 answers

SciPy kmeans2 () has some numerical problems: others reported error messages, such as "The matrix is not positive definite - the Cholesky decomposition cannot be calculated" in version 0.6.0, and I just ran into the same in version 0.7.1.

I would currently recommend using PyCluster instead . Usage example:

>>> import numpy >>> import Pycluster >>> points = numpy.vstack([numpy.random.multivariate_normal(mean, 0.03 * numpy.diag([1,1]), 20) for mean in [(1, 1), (2, 4), (3, 2)]]) >>> labels, error, nfound = Pycluster.kcluster(points, 3) >>> labels # Cluster number for each point array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32) >>> error # The within-cluster sum of distances for the solution 1.7721661785401261 >>> nfound # Number of times this solution was found 1

+28

Vebjorn Ljosa Feb 08 2018-10-10 at

source share

For continuous data, the k-tool is very simple.

You need a list of your funds, and for each data point, find the average value closest to it and average the new data point. your funds will be the last significant point clusters in the input.

I constantly average, so there is no need to have old data to get a new average. Given the old average k , the next data point x and the constant n , which is the number of past data points, to keep the average, the new average

 k*(1-(1/n)) + n*(1/n)

Here is the complete code in Python

 from __future__ import division from random import random # init means and data to random values # use real data in your code means = [random() for i in range(10)] data = [random() for i in range(1000)] param = 0.01 # bigger numbers make the means change faster # must be between 0 and 1 for x in data: closest_k = 0; smallest_error = 9999; # this should really be positive infinity for k in enumerate(means): error = abs(xk[1]) if error < smallest_error: smallest_error = error closest_k = k[0] means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print out the funds when all the data has passed, but it’s much more interesting to watch it change in real time. I used this on the frequency envelopes of 20 ms of sound and after talking with him for a minute or two, he had agreed categories for the short vowel "a", the long vowel "o" and "consonant". Wierd!

+19

Nathan Apr 09 '10 at 5:21

source share

From wikipedia you can use scipy, K-vector quantization clustering tools

Or you can use the Python shell for OpenCV, ctypes-opencv .

Or you could OpenCV the new Python interface , and their kmeans .

+5

Jacob Oct 09 '09 at 19:21

source share

(Years later) this kmeans.py in is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is simple and reasonable; it uses any of the 20 odd metrics in scipy.spatial.distance.

+5

denis Jul 04 2018-11-14T00:

source share

You can also use GDAL, which has many functions for working with spatial data.

0

George Silva 09 Oct '09 at 19:35

source share

SciKit Learn KMeans () is the easiest way to apply k-mean clustering in Python. Moving clusters are simple: kmeans = KMeans(n_clusters=2, random_state=0).fit(X) .

This code fragment shows how to store the center coordinates and predict clusters for an array of coordinates.

 >>> from sklearn.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X) >>> kmeans.labels_ array([0, 0, 0, 1, 1, 1], dtype=int32) >>> kmeans.predict([[0, 0], [4, 4]]) array([0, 1], dtype=int32) >>> kmeans.cluster_centers_ array([[ 1., 2.], [ 4., 2.]])

(courtesy of the SciKit Learn documentation linked above)

0

gsilv Feb 12 '17 at 12:45

source share

Python Pycluster and pyplot can be used to cluster k-means and to visualize 2D data. A recent blog post Analyzing stock prices / volumes using Python and PyCluster provides an example of clustering using PyCluster from stock data.

-one

Guest Sep 14 '14 at 20:47

source share

* This K-facility code from Pyhon *

 from math import math from functions import functions class KMEANS: @staticmethod def KMeans(data,classterCount,globalCounter): counter=0 classes=[] cluster =[[]] cluster_index=[] tempClasses=[] for i in range(0,classterCount): globalCounter+=1 classes.append(cluster) cluster_index.append(cluster) tempClasses.append(cluster) classes2=classes[:] for i in range(0,len(classes)): globalCounter=1 cluster = [data[i]] classes[i]=cluster functions.ResetClasterIndex(cluster_index,classterCount,globalCounter) functions.ResetClasterIndex(classes2,classterCount,globalCounter) def clusterFills(classeses,globalCounter,counter): counter+=1 combinedOfClasses = functions.CopyTo(classeses) functions.ResetClasterIndex(cluster_index,classterCount,globalCounter) functions.ResetClasterIndex(tempClasses,classterCount,globalCounter) avarage=[] for k in range(0,len(combinedOfClasses)): globalCounter+=1 avarage.append(functions.GetAvarage(combinedOfClasses[k])) for i in range(0,len(data)): globalCounter+=1 minimum=0 index=0 for k in range(0,len(avarage)): total=0.0 for j in range(0,len(avarage[k])): total += (avarage[k][j]-data[i][j]) **2 tempp=math.sqrt(total) if(k==0): minimu=tempp if(tempp&lt;=minimu): minimu=tempp index=k tempClasses[index].append(data[i]) cluster_index[index].append(i) if(functions.CompareArray(tempClasses,combinedOfClasses)==1): return clusterFills(tempClasses,globalCounter,counter) returnArray = [] returnArray.append(tempClasses) returnArray.append(cluster_index) returnArray.append(avarage) returnArray.append(counter) return returnArray cdcd = clusterFills(classes,globalCounter,counter) if cdcd !=None: return cdcd @staticmethod def KMeansPer(data,classterCount,globalCounter): perData=data[0:int(float(len(data))/100*30)] result = KMEANS.KMeans(perData,classterCount,globalCounter) cluster_index=[] tempClasses=[] classes=[] cluster =[[]] for i in range(0,classterCount): globalCounter+=1 classes.append(cluster) cluster_index.append(cluster) tempClasses.append(cluster) classes2=classes[:] for i in range(0,len(classes)): globalCounter=1 cluster = [data[i]] classes[i]=cluster functions.ResetClasterIndex(cluster_index,classterCount,globalCounter) functions.ResetClasterIndex(classes2,classterCount,globalCounter) counter=0 def clusterFills(classeses,globalCounter,counter): counter+=1 combinedOfClasses = functions.CopyTo(classeses) functions.ResetClasterIndex(cluster_index,classterCount,globalCounter) functions.ResetClasterIndex(tempClasses,classterCount,globalCounter) avarage=[] for k in range(0,len(combinedOfClasses)): globalCounter+=1 avarage.append(functions.GetAvarage(combinedOfClasses[k])) for i in range(0,len(data)): globalCounter+=1 minimum=0 index=0 for k in range(0,len(avarage)): total=0.0 for j in range(0,len(avarage[k])): total += (avarage[k][j]-data[i][j]) **2 tempp=math.sqrt(total) if(k==0): minimu=tempp if(tempp&lt;=minimu): minimu=tempp index=k tempClasses[index].append(data[i]) cluster_index[index].append(i) if(functions.CompareArray(tempClasses,combinedOfClasses)==1): return clusterFills(tempClasses,globalCounter,counter) returnArray = [] returnArray.append(tempClasses) returnArray.append(cluster_index) returnArray.append(avarage) returnArray.append(counter) return returnArray cdcd = clusterFills(result[0],globalCounter,counter) if cdcd !=None: return cdcd

Read ...

-3

Ali Osman Mollahüseyinoğlu Mar 29 '16 at 10:44

source share

tom10 · Accepted Answer · 2009-10-09 22:10

The implementation of Scipy clustering works well, and they include k-mean .

There is also a scipy-cluster , which does agglomerative clustering; The advantage is that you do not need to determine the number of clusters in advance.

Python k-means algorithm

More articles: