Iterate over all pairwise combinations of numpy array columns

Question

Iterate over all pairwise combinations of numpy array columns

I have an array with dimensions

arr.size = (200, 600, 20).

I want to compute scipy.stats.kendalltau for each pairwise combination of the last two dimensions. For instance:

 kendalltau(arr[:, 0, 0], arr[:, 1, 0]) kendalltau(arr[:, 0, 0], arr[:, 1, 1]) kendalltau(arr[:, 0, 0], arr[:, 1, 2]) ... kendalltau(arr[:, 0, 0], arr[:, 2, 0]) kendalltau(arr[:, 0, 0], arr[:, 2, 1]) kendalltau(arr[:, 0, 0], arr[:, 2, 2]) ... ... kendalltau(arr[:, 598, 20], arr[:, 599, 20])

so I cover all combinations of arr[:, i, xi] with arr[:, j, xj] with i < j and xi in [0,20) , xj in [0, 20) . This is (600 choose 2) * 400 separate calculations, but since each takes about 0.002 s on my machine, it should not take much longer than one day with the multiprocessing module.

What is the best way to iterate over these columns (using i<j )? I suggest that I should avoid something like

 for i in range(600): for j in range(i+1, 600): for xi in range(20): for xj in range(20):

What is the simplest way to do this?

Edit: I changed the name since Kendall Tau is not very important for the question. I understand that I could do something like

 import itertools as it for i, j in it.combinations(xrange(600), 2): for xi, xj in product(xrange(20), xrange(20)):

but there should be a better, more vector way with numpy.

+6

python arrays numpy statistics

wflynny Aug 9 '13 at 20:20

source share

2 answers

If you do not want to use recursion, you usually should use itertools.combinations. There is no specific reason (afaik) why this should make your code run slower. In computationally intensive parts, numpy is still being processed. Itertools also has the advantage of readability.

0

Mike vella Aug 9 '13 at 20:39

source share

Jaime · Accepted Answer · 2013-08-09T21:18:08+0000

A common way to vectorize something like this is to use translation to create a Cartesian recruitment product with yourself. In your case, you have an array of arr forms (200, 600, 20) , so you would take two kinds of it:

 arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20) arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)

The above two lines have been expanded for clarity, but I would usually write the equivalent:

 arr_x = arr[:, :, None, None] arr_y = arr

If you have a vectorized function, f , which was broadcast on all but the last dimension, you could do:

 out = f(arr[:, :, None, None], arr)

And then out will be an array of the form (200, 600, 200, 600) , and out[i, j, k, l] will contain the value f(arr[i, j], arr[k, l]) . For example, if you want to calculate all pairwise internal products, you can do:

 from numpy.core.umath_tests import inner1d out = inner1d(arr[:, :, None, None], arr)

Unfortunately, scipy.stats.kendalltau not vectorized like that. According to the documents

"If arrays are not 1-D, they will be flattened to 1-D."

So, you cannot do this, and you will complete the work of nested Python loops, whether explicitly writing them out using itertools or redefining it under np.vectorize . This will be slow due to iteration in Python variables and because you have a Python function on the iteration step, which are expensive steps.

Note that when you can go to a vectorized path, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a) , then you do twice as many necessary calculations. Depending on how expensive your actual calculation is, this is very often offset by an increase in speed from the absence of any Python loops or function calls.

Iterate over all pairwise combinations of numpy array columns

More articles: