Limited Memory Clustering

Question

Limited Memory Clustering

I am developing an application in App Engine and using kmeans2 from SciPy.

When a cluster, I get this error:

Exceeded soft private memory limit with 159.852 MB after servicing 1 requests total

That's what I'm doing, color_data will be about 5 million x, y, z points:

 def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors k def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors the colors def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors , [], [] def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors clusters def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors zip (x, y, z)), k, iter = def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors . count (i) def _cluster(color_data, k): """ Clusters colors and return top k Arguments: ---------- color_data TYPE: list DESC: The pixel rgb values to cluster k TYPE: int DESC: number of clusters to find in the colors Returns: -------- sorted_colors TYPE: list DESC: A list of rgb centroids for each color cluster """ # make rgbs into x,y,z points x,y,z = [],[],[] for color in color_data: x.append(color[0]) y.append(color[1]) z.append(color[2]) # averaged_colors are points at center of color clusters # labels are cluster numbers for each point averaged_colors, labels = kmeans2(array(zip(x,y,z)), k, iter=10) # get count of nodes per cluster frequencies = {} for i in range(k): frequencies[i] = labels.tolist().count(i) # sort labels on frequency sorted_labels = sorted(frequencies.iteritems(), key=itemgetter(1)) # sort colors on label they belong to sorted_colors = [] for l in sorted_labels: sorted_colors.append(tuple(averaged_colors[l[0]].tolist())) return sorted_colors

How can I do this under the 128 MB of memory?

EDIT: On my local computer run my application shows ~ 500 MB of memory used in my activity monitor

+4

for python google-the app-engine scipy cluster-analysis the k-Means

Michael johnston Jul 17 '13 at 1:35

source share

2 answers

Anony-mousse · Answer 1 · 2013-07-17T08:01:25+0000

Do not use all of the pixels.

K-Means usually returns almost identical results, if you use only 10% or less pixels. Because it calculates the means, and the average no longer changes if you add additional information, if the data is not distributed in different ways.

Only use 10% of the pixels should make your application to use much less memory.

sarwar · Answer 2 · 2013-07-17T03:26:57+0000

If you can not reduce the long-term memory usage in their operations, you should look for the answer to be advised to increase the amount of memory allocation within the application, or switch to another provider. For $ 20 / month is a simple rackspace server request, though, by definition, it is closer to metal and requires more setup.

Limited Memory Clustering

More articles: