I use the tf.nn.sigmoid_cross_entropy_with_logits function for loss, and it goes to NaN.
I already use gradient cropping, one place where tensor division is performed, I added epsilon to prevent division by zero, and arguments for all softmax functions are also added to epsilon.
However, I get NaN in the middle of training.
Are there any known issues in which tensor flow does this that I missed? This is rather unpleasant because the loss randomly goes to NaN during training and destroys everything.
In addition, how can I determine if a train step will lead to NaN and possibly skip this example altogether? Any suggestions?
EDIT: The Net is a Turing Neural Machine.
EDIT 2: I downloaded part of the code here . It is not commented on and makes sense to those who have already read the NTM article from Graves et al. available here: https://arxiv.org/abs/1410.5401
I'm not sure that all of my code follows exactly as the authors of the article suggested. I just do it as a practice, and I have no mentors to correct me.
EDIT 3: Here is the code for clipping the gradient:
optimizer = tf.train.AdamOptimizer(self.lr)
gvs = optimizer.compute_gradients(loss)
capped_gvs =\
[(tf.clip_by_value(grad, -1.0, 1.0), var) if grad != None else (grad, var) for grad, var in gvs]
train_step = optimizer.apply_gradients(capped_gvs)
I had to add a condition if grad != Nonebecause I got an error without it. Could the problem be here?
Potential solution: I have been using tf.contrib.losses.sigmoid_cross_entropy for a while, and so far the loss has not diverged. Will check a few more and report back.