Pure-python RNN and anano RNN calculating different gradients - code and results

I banged my head about this for a while and cannot understand what I did wrong (if at all) when implementing these RNNs. To save you from the advanced phase, I can tell you that the two implementations are calculating the same outputs, so the direct phase is correct. The problem is the reverse phase.

Here is my python reverse code. It follows the type of Carpathian nevectalk rather closely, but not quite:

def backward(self, cache, target,c=leastsquares_cost, dc=leastsquares_dcost): ''' cache is from forward pass c is a cost function dc is a function used as dc(output, target) which gives the gradient dc/doutput ''' XdotW = cache['XdotW'] #num_time_steps x hidden_size Hin = cache['Hin'] # num_time_steps x hidden_size T = Hin.shape[0] Hout = cache['Hout'] Xin = cache['Xin'] Xout = cache['Xout'] Oin = cache['Oin'] # num_time_steps x output_size Oout=cache['Oout'] dcdOin = dc(Oout, target) # this will be num_time_steps x num_outputs. these are dc/dO_j dcdWho = np.dot(Hout.transpose(), dcdOin) # this is the sum of outer products for all time # bias term is added at the end with coefficient 1 hence the dot product is just the sum dcdbho = np.sum(dcdOin, axis=0, keepdims=True) #this sums all the time steps dcdHout = np.dot(dcdOin, self.Who.transpose()) #reflects dcdHout_ij should be the dot product of dcdoin and the i'th row of Who; this is only for the outputs # now go back in time dcdHin = np.zeros(dcdHout.shape) # for t=T we can ignore the other term (error from the next timestep). self.df is derivative of activation function (here, tanh): dcdHin[T-1] = self.df(Hin[T-1]) * dcdHout[T-1] # because we don't need to worry about the next timestep, dcdHout is already corrent for t=T for t in reversed(xrange(T-1)): # we need to add to dcdHout[t] the error from the next timestep dcdHout[t] += np.dot(dcdHin[t], self.Whh.transpose()) # now we have the correct form for dcdHout[t] dcdHin[t] = self.df(Hin[t]) * dcdHout[t] # now we've gone through all t, and we can continue dcdWhh = np.zeros(self.Whh.shape) for t in range(T-1): #skip T bc dHdin[T+1] doesn't exist dcdWhh += np.outer(Hout[t], dcdHin[t+1]) # and we can do bias as well dcdbhh = np.sum(dcdHin,axis=0, keepdims=True) # now we need to go back to the embeddings dcdWxh = np.dot(Xout.transpose(), dcdHin) return {'dcdOout': dcdOout, 'dcdWxh': dcdWxh, 'dcdWhh': dcdWhh, 'dcdWho': dcdWho, 'dcdbhh': dcdbhh, 'dcdbho': dcdbho, 'cost':c(Oout, target)} 

And here is the anano code (basically copied from another implementation that I found on the Internet. I initialize the scales for my randomized scales with pure python rnn so that everything is the same.):

 # input (where first dimension is time) u = TT.matrix() # target (where first dimension is time) t = TT.matrix() # initial hidden state of the RNN h0 = TT.vector() # learning rate lr = TT.scalar() # recurrent weights as a shared variable W = theano.shared(rnn.Whh) # input to hidden layer weights W_in = theano.shared(rnn.Wxh) # hidden to output layer weights W_out = theano.shared(rnn.Who) # bias 1 b_h = theano.shared(rnn.bhh[0]) # bias 2 b_o = theano.shared(rnn.bho[0]) # recurrent function (using tanh activation function) and linear output # activation function def step(u_t, h_tm1, W, W_in, W_out): h_t = TT.tanh(TT.dot(u_t, W_in) + TT.dot(h_tm1, W) + b_h) y_t = TT.dot(h_t, W_out) + b_o return h_t, y_t # the hidden state `h` for the entire sequence, and the output for the # entrie sequence `y` (first dimension is always time) [h, y], _ = theano.scan(step, sequences=u, outputs_info=[h0, None], non_sequences=[W, W_in, W_out]) # error between output and target error = (.5*(y - t) ** 2).sum() # gradients on the weights using BPTT gW, gW_in, gW_out, gb_h, gb_o = TT.grad(error, [W, W_in, W_out, b_h, b_o]) # training function, that computes the error and updates the weights using # SGD. 

Now here is the crazy thing. If I run the following:

 fn = theano.function([h0, u, t, lr], [error, y, h, gW, gW_in, gW_out, gb_h, gb_o], updates={W: W - lr * gW, W_in: W_in - lr * gW_in, W_out: W_out - lr * gW_out}) er, yout, hout, gWhh, gWhx, gWho, gbh, gbo =fn(numpy.zeros((n,)), numpy.eye(5), numpy.eye(5),.01) cache = rnn.forward(np.eye(5)) bc = rnn.backward(cache, np.eye(5)) print "sum difference between gWho (theano) and bc['dcdWho'] (pure python):" print np.sum(gWho - bc['dcdWho']) print "sum differnce between gWhh(theano) and bc['dcdWho'] (pure python):" print np.sum(gWhh - bc['dcdWhh']) print "sum difference between gWhx (theano) and bc['dcdWxh'] (pure pyython):" print np.sum(gWhx - bc['dcdWxh']) print "sum different between the last row of gWhx (theano) and the last row of bc['dcdWxh'] (pure python):" print np.sum(gWhx[-1] - bc['dcdWxh'][-1]) 

I get the following output:

 sum difference between gWho (theano) and bc['dcdWho'] (pure python): -4.59268040265e-16 sum differnce between gWhh(theano) and bc['dcdWhh'] (pure python): 0.120527063611 sum difference between gWhx (theano) and bc['dcdWxh'] (pure pyython): -0.332613468652 sum different between the last row of gWhx (theano) and the last row of bc['dcdWxh'] (pure python): 4.33680868994e-18 

So, I get derivatives from the weight matrix between the hidden layer and the output on the right, but not derivatives of the weight matrix, hidden β†’ hidden or input β†’ hidden. But this crazy thing is that I ALWAYS get the LAST GUIDE of entering the weight matrix β†’ hidden correctly. This is crazy for me. I have no idea what is going on here. Please note that the last input line of the weight matrix β†’ hidden does NOT correspond to the last time value or something (this will be explained, for example, by the fact that I correctly calculate derivatives for the last time time, but not correctly distributing the time back). dcdWxh - the sum over all time steps dcdWxh - so how can I get one line of this correct, but none of the others?

Can anyone help? Here I have all the ideas.

+5
source share
1 answer

You must calculate the sum of the pointwise absolute values ​​of the difference between the two matrices. A simple amount may be close to zero due to specific learning tasks (do you emulate a null function? :), whatever that is.

The last line, apparently, implements weights from a constant on a neuron, i.e. offsets, so you seem to always be right offsets (however, check the sum of the absolute values).

It also looks like a major row and the column notation of the main matrices is confused, as in

 gWhx - bc['dcdWxh'] 

which reads as the weight from "hidden to x" as opposed to "x to hidden".

I would rather post it as a comment, but I don't have enough reputation. sorry!

0
source

Source: https://habr.com/ru/post/1209440/


All Articles