The fastest way to fetch a column set from large arrays in Julia

Question

The fastest way to fetch a column set from large arrays in Julia

I am interested in performing statistics on various random samples of subsets of a large data matrix, and now the bottleneck in my code is the actual subsample. This seems strange to me, since there are enough calculated distances O (N ^ 2) for data with a subsample. I do not change subsamples at all, I just look at them.

using Distributions, Distances

function test_subsetting(X; batch_size=500, nloops=100)
    nfeatures, nsamples = size(X)
    ref_samples = X[:,1:10]
    batch_inds = zeros(batch_size)
    batch = zeros(nfeatures,batch_size)
    d_matrix = zeros(batch_size,batch_size)
    for i = 1:nloops
        batch_inds = sort(sample(1:nsamples, batch_size, replace = false))
        batch = X[:,batch_inds]
        d_matrix = pairwise(SqEuclidean(), batch, ref_samples)
    end
end

When I test it on an object 5000 by the type of matrix 50,000:

X_test = randn(5000,50000);

I see that I spend about half of my time _unsafe_getindexin multidimensional.jl, and the other half do distance calculation.

Is there a more efficient way to solve this problem?

+4

arrays subset julia-lang

Rory Jul 20 '16 at 23:30

source share

2

"", "" , . " ", .

, ? , , ?

+2

Tasos Papastylianou 21 . '16 1:44

tholy · Accepted Answer · 2016-07-23T17:42:23+0000

julia 0.5:

julia> using Distances, Distributions

julia> X = randn(500,1000);

julia> S = sample(1:1000,500,replace=false);

julia> M = view(X, :, S);

julia> S2 = sample(1:1000,500,replace=false);

julia> R = view(X, :, S2);

julia> pairwise(SqEuclidean(), M, R)
500×500 Array{Float64,2}:
  994.67 ...
...

view julia 0.5 slice ( sub, ) julia 0.4. ArrayViews.view, - , .

, view slice, , , At_mul_B! julia 0.4. , .

The fastest way to fetch a column set from large arrays in Julia

More articles: