Python numpy: execute a function for each pair of columns in a two-dimensional numpy array?

Question

Python numpy: execute a function for each pair of columns in a two-dimensional numpy array?

I am trying to apply a function to each pair of columns in a numpy array (each column is a separate genotype).

For instance:

[48]: g[0:10,0:10]

array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=int8)

My goal is to create a distance matrix d so that each element d is a pair distance comparing each column in g.

d[0,1] = func(g[:,0], g[:,1])

Any ideas would be fantastic! Thank!

+4

function python numpy

ksw Apr 13 '18 at 16:25

source share

3 answers

Kasramvd · Answer 1 · 2018-04-13T16:48:08+0000

You can create the expected pairs with np.dstack, and then apply the function on the third axis with np.apply_along_axis.

new = np.dstack((arr[:,:-1], arr[:, 1:]))
np.apply_along_axis(np.sum, 2, new)

Example:

In [86]: arr = np.array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
    ...:        [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=np.int8)
    ...:        
    ...:        

In [87]: new = np.dstack((arr[:,:-1], arr[:, 1:]))

In [88]: new
Out[88]: 
array([[[ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1, -1]],

    ...

In [89]: 

In [89]: np.apply_along_axis(np.sum, 2, new)
Out[89]: 
array([[ 2,  2,  2,  2,  2,  2,  2,  2,  0],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  0,  0,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  0],
       [-2, -1, -1, -2, -2, -2, -2, -2, -1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2]])

xg.plt.py · Answer 2 · 2018-04-13T16:48:08+0000

You can simply define the function as:

def count_snp_diffs(x, y): 
    return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)

, , itertools.combinations, :

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

, ( g , , , :

d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

d[i,j] i j ( d[j,i] ). , , :

a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]

, .

g[:,combinations[:,0]] , , , g[:,combinations[:,1]]. , diff. g 3 , , 0,1, 0,2 1,2:

[[ True False False]
 [False  True False]
 [ True  True False]
 [False False False]
 [False  True False]
 [False False False]]

, , :

np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]

, - , python ( False==0 True==1), . " False == 0 True == 1 Python ?" ). np.count_nonzero 1 True, , np.sum:

np.sum(diff,axis=0) 
# Out
# [2 3 0]

, Memory Error, , , , , . :

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
    dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
    # B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])

g.shape=(300,N) , %%timeit python 2.7, numpy 1.14.2 allel 1.1.10:

10
- numpy + : 107
- numpy + 1D storage: 101 μs
- allel: 247
100
- numpy +: 15,7
- numpy + 1D storage: 16 ms
- allel: 22,6
1000
- numpy + : 1.54
- numpy + 1D storage: 1.53 s
- allel: 2.28 s

numpy litle , allel-, .

ksw · Answer 3 · 2018-04-13T17:53:02+0000

Thanks for the suggestions! I was just told that this is possible with scikit-allel: https://scikit-allel.readthedocs.io/en/latest/ , where you can define your own distance matrix that will be executed on pairwise combinations of columns in a two-dimensional array numpy:

dist = allel.pairwise_distance(g, metric=count_snp_diffs)

Thank you for your help!

http://alimanfoo.imtqy.com/2016/06/10/scikit-allel-tour.html

Python numpy: execute a function for each pair of columns in a two-dimensional numpy array?

More articles: