Theano hard_sigmoid () destroys gradient descent

in order to highlight the problem, follow these tutorial .


theano has 3 ways to calculate the sigmoid tensor, namely sigmoid , ultra_fast_sigmoid and hard_sidmoid . Using the last two seems to break the gradient descent algorithm.

An ordinary sigmoid converges as it should, but others have strange inconsistent behavior. ultra_fast_sigmoid, just throws a direct error when trying to calculate the gradient "Method not defined (" grad ", ultra_fast_sigmoid), while hard_sigmoid compiles fine, but does not converge to the solution.


Does anyone know the source of this behavior? It does not stand out in the documentation that this should happen, and seems intuitive.


code:

import theano import theano.tensor as T import theano.tensor.nnet as nnet import numpy as np x = T.dvector() y = T.dscalar() def layer(x, w): b = np.array([1], dtype=theano.config.floatX) new_x = T.concatenate([x, b]) m = T.dot(wT, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1 h = nnet.sigmoid(m) ## THIS SIGMOID RIGHT HERE return h def grad_desc(cost, theta): alpha = 0.1 #learning rate return theta - (alpha * T.grad(cost, wrt=theta)) theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX)) theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX)) hid1 = layer(x, theta1) #hidden layer out1 = T.sum(layer(hid1, theta2)) #output layer fc = (out1 - y)**2 #cost expression cost = theano.function(inputs=[x, y], outputs=fc, updates=[ (theta1, grad_desc(fc, theta1)), (theta2, grad_desc(fc, theta2))]) run_forward = theano.function(inputs=[x], outputs=out1) inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X exp_y = np.array([1, 1, 0, 0]) #training data Y cur_cost = 0 for i in range(2000): for k in range(len(inputs)): cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space) print('Cost: %s' % (cur_cost,)) print(run_forward([0,1])) print(run_forward([1,1])) print(run_forward([1,0])) print(run_forward([0,0])) 

I changed the following lines from the code to make a shorter conclusion for this message (they differ from the tutorial, but are already contained in the code above):

 from theano.tensor.nnet import binary_crossentropy as cross_entropy #imports fc = cross_entropy(out1, y) #cost expression for i in range(4000): #training iteration 

sigmoid

 Cost: 1.62724279493 Cost: 0.545966632545 Cost: 0.156764560912 Cost: 0.0534911098234 Cost: 0.0280394147992 Cost: 0.0184933786794 Cost: 0.0136444190935 Cost: 0.0107482836159 0.993652087577 0.00848194143055 0.990829396285 0.00878482655791 

ultra_fast_sigmoid

  File "test.py", line 30, in <module> (theta1, grad_desc(fc, theta1)), File "test.py", line 19, in grad_desc return theta - (alpha * T.grad(cost, wrt=theta)) File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 545, in grad grad_dict, wrt, cost_name) File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1283, in _populate_grad_dict rval = [access_grad_cache(elem) for elem in wrt] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache output_grads = [access_grad_cache(var) for var in node.outputs] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache output_grads = [access_grad_cache(var) for var in node.outputs] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache output_grads = [access_grad_cache(var) for var in node.outputs] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache output_grads = [access_grad_cache(var) for var in node.outputs] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 951, in access_term_cache output_grads = [access_grad_cache(var) for var in node.outputs] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1241, in access_grad_cache term = access_term_cache(node)[idx] File "/usr/local/lib/python2.7/dist-packages/theano/gradient.py", line 1089, in access_term_cache input_grads = node.op.grad(inputs, new_output_grads) File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 662, in grad rval = self._bgrad(inputs, ograds) File "/usr/local/lib/python2.7/dist-packages/theano/tensor/elemwise.py", line 737, in _bgrad scalar_igrads = self.scalar_op.grad(scalar_inputs, scalar_ograds) File "/usr/local/lib/python2.7/dist-packages/theano/scalar/basic.py", line 878, in grad self.__class__.__name__) theano.gof.utils.MethodNotDefined: ('grad', <class 'theano.tensor.nnet.sigm.UltraFastScalarSigmoid'>, 'UltraFastScalarSigmoid') 

hard_sigmoid

 Cost: 1.19810193303 Cost: 0.684360309062 Cost: 0.692614056124 Cost: 0.697902474354 Cost: 0.701540531661 Cost: 0.703807604483 Cost: 0.70470238116 Cost: 0.704385738831 0.4901260624 0.486248177053 0.489490785078 0.493368670425 
+5
source share
1 answer

Here is the source code of hard_sigmoid :

 def hard_sigmoid(x): """An approximation of sigmoid. More approximate and faster than ultra_fast_sigmoid. Approx in 3 parts: 0, scaled linear, 1 Removing the slope and shift does not make it faster. """ # Use the same dtype as determined by "upgrade_to_float", # and perform computation in that dtype. out_dtype = scalar.upgrade_to_float(scalar.Scalar(dtype=x.dtype))[0].dtype slope = tensor.constant(0.2, dtype=out_dtype) shift = tensor.constant(0.5, dtype=out_dtype) x = (x * slope) + shift x = tensor.clip(x, 0, 1) return x 

Thus, it is simply implemented as a piecewise linear function, the gradient of which is 0.2 in the range (-2.5, 2.5) and 0 in another place. This means that if the input signal goes beyond the range (-2.5, 2.5), its gradient will be zero, and no training will occur.

Thus, this may not be practical for training, but may be used to approximate the outcome of the prediction.


Edit:
To evaluate the gradient of network parameters, we usually use backpropagation .
Here is a very simple example.

 x = theano.tensor.scalar() w = theano.shared(numpy.float32(1)) y = theano.tensor.nnet.hard_sigmoid(w*x) # y=w*x, w is initialized to 1. dw = theano.grad(y, w) # gradient wrt w, which is equal to slope*x in this case net = theano.function([x], [y, dw]) print net(-3) print net(-1) print net(0) print net(1) print net(3) Output: [array(0.0), array(-0.0)] # zero gradient because the slope is zero [array(0.3), array(-0.2)] [array(0.5), array(0.0)] # zero gradient because x is zero [array(0.7), array(0.2)] [array(1.0), array(0.0)] # zero gradient because the slope is zero 

EDIT OP:
ultra_hard_sigmoid fails if you look at the implementation of the source code because it is hard-coded in python and not processed by tensor expressions.

+3
source

Source: https://habr.com/ru/post/1242258/


All Articles