Calculate a histogram of the distances between points in a large dataset

Question

Calculate a histogram of the distances between points in a large dataset

I have a large dataset representing 1.2 M points in a 220-dimensional periodic space (x changes fom (-pi, pi)) ... (matrix: 1.2M x 220).

I would like to calculate a histogram of the distances between these points, taking into account the frequency. I wrote the code in python, but still it works quite slowly for my test case (I'm not even trying to run it on the whole set ...).

Can you take a look and help me with some customization?

Any suggestions or comments that were highly appreciated.

import numpy as np

# 1000x220 test set (-pi,pi)

d=np.random.random((1000, 220))*2*np.pi-np.pi

# calculating theoretical limit on the histogram range, max distance between 
# two points can be pi in each dimension

m=np.zeros(np.shape(d)[1])+np.pi
m_=np.sqrt(np.sum(m**2))
# hist range is from 0 to mm
mm=np.floor(m_)
bins=mm/0.01
m=np.zeros(bins)

# proper calculations

import time
start_time = time.time()

for i in range(np.shape(d)[0]):
        diff=d[:-(i+1),:]-d[i+1:,:]
        diff=np.absolute(diff)
        adiff=diff-np.pi
        diff=np.pi-np.absolute(adiff)
        s=np.sqrt(np.einsum('ij,ij->i', diff,diff))
        m+=np.histogram(s,range=(0,mm),bins=bins)[0]


print time.time() - start_time

+4

python numpy scipy distance

didymos Apr 10 '14 at 16:13

source share

1 answer

tomer.z · Answer 1 · 2014-09-07T23:52:09+0000

, , (...) ,

Calculate a histogram of the distances between points in a large dataset

More articles: