I am interested in performing statistics on various random samples of subsets of a large data matrix, and now the bottleneck in my code is the actual subsample. This seems strange to me, since there are enough calculated distances O (N ^ 2) for data with a subsample. I do not change subsamples at all, I just look at them.
using Distributions, Distances
function test_subsetting(X; batch_size=500, nloops=100)
nfeatures, nsamples = size(X)
ref_samples = X[:,1:10]
batch_inds = zeros(batch_size)
batch = zeros(nfeatures,batch_size)
d_matrix = zeros(batch_size,batch_size)
for i = 1:nloops
batch_inds = sort(sample(1:nsamples, batch_size, replace = false))
batch = X[:,batch_inds]
d_matrix = pairwise(SqEuclidean(), batch, ref_samples)
end
end
When I test it on an object 5000 by the type of matrix 50,000:
X_test = randn(5000,50000);
I see that I spend about half of my time _unsafe_getindexin multidimensional.jl, and the other half do distance calculation.
Is there a more efficient way to solve this problem?
source
share