I banged my head about this for a while and cannot understand what I did wrong (if at all) when implementing these RNNs. To save you from the advanced phase, I can tell you that the two implementations are calculating the same outputs, so the direct phase is correct. The problem is the reverse phase.
Here is my python reverse code. It follows the type of Carpathian nevectalk rather closely, but not quite:
def backward(self, cache, target,c=leastsquares_cost, dc=leastsquares_dcost): ''' cache is from forward pass c is a cost function dc is a function used as dc(output, target) which gives the gradient dc/doutput ''' XdotW = cache['XdotW']
And here is the anano code (basically copied from another implementation that I found on the Internet. I initialize the scales for my randomized scales with pure python rnn so that everything is the same.):
# input (where first dimension is time) u = TT.matrix()
Now here is the crazy thing. If I run the following:
fn = theano.function([h0, u, t, lr], [error, y, h, gW, gW_in, gW_out, gb_h, gb_o], updates={W: W - lr * gW, W_in: W_in - lr * gW_in, W_out: W_out - lr * gW_out}) er, yout, hout, gWhh, gWhx, gWho, gbh, gbo =fn(numpy.zeros((n,)), numpy.eye(5), numpy.eye(5),.01) cache = rnn.forward(np.eye(5)) bc = rnn.backward(cache, np.eye(5)) print "sum difference between gWho (theano) and bc['dcdWho'] (pure python):" print np.sum(gWho - bc['dcdWho']) print "sum differnce between gWhh(theano) and bc['dcdWho'] (pure python):" print np.sum(gWhh - bc['dcdWhh']) print "sum difference between gWhx (theano) and bc['dcdWxh'] (pure pyython):" print np.sum(gWhx - bc['dcdWxh']) print "sum different between the last row of gWhx (theano) and the last row of bc['dcdWxh'] (pure python):" print np.sum(gWhx[-1] - bc['dcdWxh'][-1])
I get the following output:
sum difference between gWho (theano) and bc['dcdWho'] (pure python): -4.59268040265e-16 sum differnce between gWhh(theano) and bc['dcdWhh'] (pure python): 0.120527063611 sum difference between gWhx (theano) and bc['dcdWxh'] (pure pyython): -0.332613468652 sum different between the last row of gWhx (theano) and the last row of bc['dcdWxh'] (pure python): 4.33680868994e-18
So, I get derivatives from the weight matrix between the hidden layer and the output on the right, but not derivatives of the weight matrix, hidden β hidden or input β hidden. But this crazy thing is that I ALWAYS get the LAST GUIDE of entering the weight matrix β hidden correctly. This is crazy for me. I have no idea what is going on here. Please note that the last input line of the weight matrix β hidden does NOT correspond to the last time value or something (this will be explained, for example, by the fact that I correctly calculate derivatives for the last time time, but not correctly distributing the time back). dcdWxh - the sum over all time steps dcdWxh - so how can I get one line of this correct, but none of the others?
Can anyone help? Here I have all the ideas.