Looping (or vectorizing) variable-length matrices in Theano

I have a list of matrices L , where each element of M is a matrix x*n ( x is a variable, n is a constant).

I want to calculate the sum of M'*M for all elements in L ( M' is a transpose of M ), as the following Python code does:

 for M in L: res += np.dot(MT, M) 

Actually I want to implement this in Theano (which does not support multidimensional arrays of variable length), and I do not want to lay all matrices in one size, because it will waste too much space (some of the matrices can be very large).

Is there a better way to do this?

Edit

L is known before the compilation of Theano.

Edit

received two excellent answers from @DanielRenshaw and @Divakar, emotionally difficult to choose to accept.

+5
source share
3 answers

You can simply put the input arrays along the first axis that contains all x . Thus, we get a high array (X,n) , where X =x1+x2+x3+.... This can be transposed, and its point product with its β€œI” will be the desired output of the form (n,n) . All this is achieved using a clean, vectorized solution using a powerful point product. So the implementation will be -

 # Concatenate along axis=0 Lcat = np.concatenate(L,axis=0) # Perform dot product of the transposed version with self out = Lcat.T.dot(Lcat) 

Run-time checks and output checks -

 In [116]: def vectoized_approach(L): ...: Lcat = np.concatenate(L,axis=0) ...: return Lcat.T.dot(Lcat) ...: ...: def original_app(L): ...: n = L[0].shape[1] ...: res = np.zeros((n,n)) ...: for M in L: ...: res += np.dot(MT, M) ...: return res ...: In [117]: # Input ...: L = [np.random.rand(np.random.randint(1,9),5)for iter in range(1000)] In [118]: np.allclose(vectoized_approach(L),original_app(L)) Out[118]: True In [119]: %timeit original_app(L) 100 loops, best of 3: 3.84 ms per loop In [120]: %timeit vectoized_approach(L) 1000 loops, best of 3: 632 Β΅s per loop 
+3
source

Given that the number of matrices is known before Theano is compiled, you can simply use regular Python lists from Theano matrices.

Here is a complete example showing the difference between numpy and Theano versions.

This code has been updated to include comparison with @Divakar's vectorized approach, which works better. For Teano, two vectorized approaches are possible, where Teano performs concatenation, and the other is that numpy performs concatenation, the result of which is then passed to Theano.

 import timeit import numpy as np import theano import theano.tensor as tt def compile_theano_version1(number_of_matrices, n, dtype): assert number_of_matrices > 0 assert n > 0 L = [tt.matrix() for _ in xrange(number_of_matrices)] res = tt.zeros(n, dtype=dtype) for M in L: res += tt.dot(MT, M) return theano.function(L, res) def compile_theano_version2(number_of_matrices): assert number_of_matrices > 0 L = [tt.matrix() for _ in xrange(number_of_matrices)] concatenated_L = tt.concatenate(L, axis=0) res = tt.dot(concatenated_L.T, concatenated_L) return theano.function(L, res) def compile_theano_version3(): concatenated_L = tt.matrix() res = tt.dot(concatenated_L.T, concatenated_L) return theano.function([concatenated_L], res) def numpy_version1(*L): assert len(L) > 0 n = L[0].shape[1] res = np.zeros((n, n), dtype=L[0].dtype) for M in L: res += np.dot(MT, M) return res def numpy_version2(*L): concatenated_L = np.concatenate(L, axis=0) return np.dot(concatenated_L.T, concatenated_L) def main(): iteration_count = 100 number_of_matrices = 20 n = 300 min_x = 400 dtype = 'float64' theano_version1 = compile_theano_version1(number_of_matrices, n, dtype) theano_version2 = compile_theano_version2(number_of_matrices) theano_version3 = compile_theano_version3() L = [np.random.standard_normal(size=(x, n)).astype(dtype) for x in range(min_x, number_of_matrices + min_x)] start = timeit.default_timer() numpy_res1 = np.sum(numpy_version1(*L) for _ in xrange(iteration_count)) print 'numpy_version1', timeit.default_timer() - start start = timeit.default_timer() numpy_res2 = np.sum(numpy_version2(*L) for _ in xrange(iteration_count)) print 'numpy_version2', timeit.default_timer() - start start = timeit.default_timer() theano_res1 = np.sum(theano_version1(*L) for _ in xrange(iteration_count)) print 'theano_version1', timeit.default_timer() - start start = timeit.default_timer() theano_res2 = np.sum(theano_version2(*L) for _ in xrange(iteration_count)) print 'theano_version2', timeit.default_timer() - start start = timeit.default_timer() theano_res3 = np.sum(theano_version3(np.concatenate(L, axis=0)) for _ in xrange(iteration_count)) print 'theano_version3', timeit.default_timer() - start assert np.allclose(numpy_res1, numpy_res2) assert np.allclose(numpy_res2, theano_res1) assert np.allclose(theano_res1, theano_res2) assert np.allclose(theano_res2, theano_res3) main() 

When doing these prints (something like)

 numpy_version1 1.47830819649 numpy_version2 1.77405482179 theano_version1 1.3603150303 theano_version2 1.81665318145 theano_version3 1.86912039489 

Statements pass, showing that both Theano and numpy versions both calculate the same result with a high degree of accuracy. Obviously, this accuracy will decrease if float32 used instead of float64 .

Synchronization results show that a vector approach cannot be preferred; it depends on the size of the matrix. In the above example, the matrices are large, and the approach without concatenation is faster, but if the parameters n and min_x changed in the main function much less, then the concatenation approach is faster. Other results may occur while working on the GPU (Theano version only).

+5
source

In addition to @DanielRenshaw's answer, if we increase the number of matrices to 1000, the compile_theano_version1 function will give RuntimeError: maximum recursion depth exceeded , and compile_theano_version2 seems to be compiled forever.

Here it is fixed with typed_list :

 def compile_theano_version4(number_of_matrices, n): import theano.typed_list L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))() res, _ = theano.scan(fn=lambda i: tt.dot(L[i].T, L[i]), sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')]) return theano.function([L], res.sum(axis=0)) 

In addition, I set the data type of all relevant float32 variables and ran the @DanielRenshaw script on the GPU, it turned out that the @Divakar ( theano_version3 ) theano_version3 is the most efficient in this case. Although, as Daniel Renshaw said, using a huge matrix may not always be good practice.

Below are the settings and outputs on my machine.

 iteration_count = 100 number_of_matrices = 200 n = 300 min_x = 20 dtype = 'float32' theano.config.floatX = dtype numpy_version1 5.30542397499 numpy_version2 3.96656394005 theano_version1 5.26742005348 theano_version2 1.76983904839 theano_version3 1.03577589989 theano_version4 5.58366179466 
+1
source

Source: https://habr.com/ru/post/1239425/


All Articles