Numpy - Clustering - Distance - Vectorization

I collected a sample of data (400 k samples, dimension = 205, 200 clusters) using sklearn Kmeans.

I want to know for each cluster the maximum distance between the center of the cluster and the most distant cluster sample in order to learn something about the "size" of the cluster. Here is my code:

import numpy as np
import scipy.spatial.distance as spd
diam = np.empty([200])
for i in range(200):
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()

"seeds" are the centers of the clusters (200x206). The first “seed” column contains the number of samples within the cluster (not relevant here).

"data" are samples (400kx206). The first data column contains the cluster number.

Question: this is done using a loop (not so "numpy"). Is it possible to "vectorize" it?

+4
source share
3 answers

We can be a little smarter about indexing and save about ~ 4 times the price.

First, create some data of the correct form:

seed = np.random.randint(0, 100, (200,206))
data = np.random.randint(0, 100, (4e5,206))
seed[:, 0] = np.arange(200)
data[:, 0] = np.random.randint(0, 200, 4e5)
diam = np.empty(200)

Initial Response Time:

%%timeit
for i in range(200):
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()

1 loops, best of 3: 1.35 s per loop

moarningsun answer:

%%timeit
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)

1 loops, best of 3: 1.33 s per loop

Divakar's answer:

%%timeit
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
diam_out = np.maximum.reduceat(dists,shift_idx)

1 loops, best of 3: 1.65 s per loop

As we can see, nothing really happened with the help of vectorized solutions, except for a larger amount of memory. To avoid this, we need to go back to the original answer, which is really the right way to do this, and instead try to reduce the amount of indexing:

%%timeit
idx = data[:,0].argsort()
bins = np.bincount(data[:,0])
counter = 0
for i in range(200):
    data_slice = idx[counter: counter+bins[i]]
    diam[i] = spd.cdist(seed[None, i, 1:], data[data_slice, 1:]).max()
    counter += bins[i]

1 loops, best of 3: 281 ms per loop

Double check the answer:

np.allclose(diam, dam_out)
True

This is a problem with assuming python loops are bad. They often happen, but not in all situations.

+2
source

Here is one vector approach -

# Sort data w.r.t. col-0
data_sorted = data[data[:, 0].argsort()]

# Get counts of unique tags in col-0 of data and repeat seed accordingly. 
# Thus, we would have an extended version of seed that matches data shape.
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)

# Get euclidean distances between extended seed version and sorted data
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))

# Get positions of shifts in col-0 of sorted data
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)

# Final piece of puzzle is to get tag based maximum values from dists, 
# where each tag is unique number in col-0 of data
diam_out = np.maximum.reduceat(dists,shift_idx)

Run-time checks and output checks -

Define functions:

def loopy_cdist(seed,data):  # from OP solution
    N = seed.shape[0]
    diam = np.empty(N)
    for i in range(N):
        diam[i]=spd.cdist(seed[np.newaxis,i,1:], data[data[:,0]==i][:,1:]).max()
    return diam

def vectorized_repeat_reduceat(seed,data):  # from this solution
    data_sorted = data[data[:, 0].argsort()]
    seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
    dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
    shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
    return np.maximum.reduceat(dists,shift_idx)

def vectorized_indexing_maxat(seed,data): # from @moarningsun solution
    seed_repeated = seed[data[:,0]]
    dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
    diam = np.zeros(len(seed))
    np.maximum.at(diam, data[:,0], dist_to_center)
    return diam

Check outs:

In [417]: # Inputs
     ...: seed = np.random.rand(20,20)
     ...: data = np.random.randint(0,20,(40000,20))
     ...: 

In [418]: np.allclose(loopy_cdist(seed,data),vectorized_repeat_reduceat(seed,data))
Out[418]: True

In [419]: np.allclose(loopy_cdist(seed,data),vectorized_indexing_maxat(seed,data))
Out[419]: True

Runtimes:

In [420]: %timeit loopy_cdist(seed,data)
10 loops, best of 3: 35.9 ms per loop

In [421]: %timeit vectorized_repeat_reduceat(seed,data)
10 loops, best of 3: 28.9 ms per loop

In [422]: %timeit vectorized_indexing_maxat(seed,data)
10 loops, best of 3: 24.1 ms per loop
+1

, @Divakar, :

seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))

diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)

ufunc.at, , , , .

+1

Source: https://habr.com/ru/post/1614788/


All Articles