How effective / smart is Teano in the gradients of calculations?

Question

How effective / smart is Teano in the gradients of calculations?

Suppose I have artificial neural networks with 5 hidden layers. For now, forget about the details of the neural network model, such as prejudices, activation functions used, data type, etc. ..... Of course, the activation functions are differentiable.

With symbolic differentiation, the following computes the gradients of the objective function relative to the layer weights:

w1_grad = T.grad(lost, [w1]) w2_grad = T.grad(lost, [w2]) w3_grad = T.grad(lost, [w3]) w4_grad = T.grad(lost, [w4]) w5_grad = T.grad(lost, [w5]) w_output_grad = T.grad(lost, [w_output])

Thus, in order to calculate the gradients wrt w1, you first need to calculate the gradients wrt w2, w3, w4 and w5. Similar to the calculation of wrt w2 gradients, wrt w3, w4 and w5 gradients must first be calculated.

However, I could also calculate the wrt gradients for each weight matrix as follows:

 w1_grad, w2_grad, w3_grad, w4_grad, w5_grad, w_output_grad = T.grad(lost, [w1, w2, w3, w4, w5, w_output])

I was wondering if there is a difference between the two methods in terms of performance? Is Teono smart enough to avoid recalculating gradients using the second method? By intelligent, I mean to compute w3_grad, Theano should [preferably] use the pre-computed gradients w_output_grad, w5_grad and w4_grad instead of reusing them.

+5

python-2.7 theano gradient neural-network automatic-differentiation

Amir Dec 22 '15 at 10:02

source share

1 answer

Amir · Accepted Answer · 2015-12-27T21:51:47+0000

Well, it turns out that Theano does not accept previously calculated gradients to calculate gradients in the lower layers of a computational graph. Here is a dummy example of a neural network with 3 hidden layers and an output layer. However, it will not be of much importance, since the calculation of gradients is a once-in-a-lifetime operation if you do not need to calculate the gradient at each iteration. Theano returns a symbolic expression for derivatives as a computational graph , and you can simply use it as a function from this point. From now on, we simply use the function obtained by Theano to calculate numerical values and update the scales using these.

 import theano.tensor as T import time import numpy as np class neuralNet(object): def __init__(self, examples, num_features, num_classes): self.w = shared(np.random.random((16384, 5000)).astype(T.config.floatX), borrow = True, name = 'w') self.w2 = shared(np.random.random((5000, 3000)).astype(T.config.floatX), borrow = True, name = 'w2') self.w3 = shared(np.random.random((3000, 512)).astype(T.config.floatX), borrow = True, name = 'w3') self.w4 = shared(np.random.random((512, 40)).astype(T.config.floatX), borrow = True, name = 'w4') self.b = shared(np.ones(5000, dtype=T.config.floatX), borrow = True, name = 'b') self.b2 = shared(np.ones(3000, dtype=T.config.floatX), borrow = True, name = 'b2') self.b3 = shared(np.ones(512, dtype=T.config.floatX), borrow = True, name = 'b3') self.b4 = shared(np.ones(40, dtype=T.config.floatX), borrow = True, name = 'b4') self.x = examples L1 = T.nnet.sigmoid(T.dot(self.x, self.w) + self.b) L2 = T.nnet.sigmoid(T.dot(L1, self.w2) + self.b2) L3 = T.nnet.sigmoid(T.dot(L2, self.w3) + self.b3) L4 = T.dot(L3, self.w4) + self.b4 self.forwardProp = T.nnet.softmax(L4) self.predict = T.argmax(self.forwardProp, axis = 1) def loss(self, y): return -T.mean(T.log(self.forwardProp)[T.arange(y.shape[0]), y]) x = T.matrix('x') y = T.ivector('y') nnet = neuralNet(x) loss = nnet.loss(y) diffrentiationTime = [] for i in range(100): t1 = time.time() gw, gw2, gw3, gw4, gb, gb2, gb3, gb4 = T.grad(loss, [nnet.w, nnet.w2, logReg.w3, nnet.w4, nnet.b, nnet.b2, nnet.b3, nnet.b4]) diffrentiationTime.append(time.time() - t1) print 'Efficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime)) diffrentiationTime = [] for i in range(100): t1 = time.time() gw = T.grad(loss, [nnet.w]) gw2 = T.grad(loss, [nnet.w2]) gw3 = T.grad(loss, [nnet.w3]) gw4 = T.grad(loss, [nnet.w4]) gb = T.grad(loss, [nnet.b]) gb2 = T.grad(loss, [nnet.b2]) gb3 = T.grad(loss, [nnet.b3]) gb4 = T.grad(loss, [nnet.b4]) diffrentiationTime.append(time.time() - t1) print 'Inefficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

The following will open:

 Efficient Method: Took 0.061056 seconds with std 0.013217 Inefficient Method: Took 0.305081 seconds with std 0.026024

This shows that Theano uses a dynamic programming approach to calculate gradients for an efficient method.

How effective / smart is Teano in the gradients of calculations?

More articles: