Optimized calculation of pair correlations in Python

Given a set of discrete locations (for example, “sites”) that are pairwise connected in some categorical ways (for example, overall proximity) and contain local level data (for example, population size), I want to efficiently calculate the average correlation coefficients between local level data in paired locations that are characterized by the same relationship.

For example, I assumed 100 sites and randomized their pairwise relationships using values ​​from 1 to 25, getting a triangular matrix relations:

import numpy as np

sites = 100
categ = 25

relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation

I also have 5,000 replicas of simulation results on each site:

sims = 5000
res = np.round(np.random.rand(sites, sims),1)

, i rho[j] res j, i:

rho_list = np.ones(categ)*99

for i in range(1, categ+1):
    idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
    comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
    comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs

    rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category

    for j in range(len(idr)): # loop through unique site pairs
        comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
        rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]

    rho_list[i-1] = np.nanmean(rho)

script , sites = 400, 6 , . ? ?

+4
1

j masking, . np.corrcoef ( this post). , , , .

, - -

for i in range(1, categ+1):
    r,c = np.where(relations==i)

    A = res[r]
    B = res[c]

    mask0 = ~((A!=0) & (B!=0))
    A[mask0] = 0
    B[mask0] = 0

    count = mask0.shape[-1] - mask0.sum(-1,keepdims=1)
    A_mA = A - A.sum(-1, keepdims=1)/count
    B_mB = B - B.sum(-1, keepdims=1)/count

    A_mA[mask0] = 0
    B_mB[mask0] = 0

    ssA = np.einsum('ij,ij->i',A_mA, A_mA)
    ssB = np.einsum('ij,ij->i',B_mB, B_mB)
    rho = np.einsum('ij,ij->i',A_mA, B_mB)/np.sqrt(ssA*ssB)

    rho_list[i-1] = np.nanmean(rho)

№1: , = 100

In [381]: %timeit loopy_app()
1 loop, best of 3: 7.45 s per loop

In [382]: %timeit vectorized_app()
1 loop, best of 3: 479 ms per loop

15x+ .

№2: sites = 200

In [387]: %timeit loopy_app()
1 loop, best of 3: 1min 56s per loop

In [388]: %timeit vectorized_app()
1 loop, best of 3: 1.86 s per loop

In [390]: 116/1.86
Out[390]: 62.36559139784946

62x+.

№3: , sites = 400

In [392]: %timeit vectorized_app()
1 loop, best of 3: 7.64 s per loop

6hrs+ OP loopy.

, sites.

+6

Source: https://habr.com/ru/post/1682621/


All Articles