Tensorflow NaN error?

I use TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on a new set of images, until suddenly (it still converges, usually around 92% accuracy), it crashes with an error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with numbers until a very sudden, for some unknown reason, error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval()) print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval()) 

as a debugging code for each loop, displays the following result:

 Step 8600 max W vales: 0.759422 0.295087 0.344725 0.583884 max b vales: 0.110509 0.111748 0.115327 0.124324 Step 8601 max W vales: 0.75947 0.295084 0.344723 0.583893 max b vales: 0.110516 0.111753 0.115322 0.124332 Step 8602 max W vales: 0.759521 0.295101 0.34472 0.5839 max b vales: 0.110521 0.111747 0.115312 0.124365 Step 8603 max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38 max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38 

Since none of my values ​​is very large, the only way NaN can be is poorly crafted 0/0, but since this tutorial code does not perform any divisions or similar operations, I see no other explanation than this. this comes from the internal TF code.

I don’t know what to do with it. Any suggestions? The algorithm converges beautifully, its accuracy on my test set steadily rose and reached 92.5% at iteration 8600.

+58
nan tensorflow
Nov 14 '15 at 19:01
source share
12 answers

Actually, it turned out to be something stupid. I am posting this in case someone else encounters a similar error.

 cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv)) 

actually a terrible way to calculate cross entropy. In some examples, some classes may be excluded with certainty after some time, resulting in y_conv = 0 for this sample. This is usually not a problem, since you are not interested in it, but in how cross_entropy is written there, it gives 0 * log (0) for this particular sample / class. Therefore, NaN.

Replacing it

 cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0))) 

solved all my problems.

+128
Nov 14 '15 at 20:49
source share

In fact, clipping is not a good idea, as it will prevent the gradient from spreading back when the threshold is reached. Instead, we can add a bit of constant to the output of softmax.

 cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10)) 
+26
Jul 30 '16 at 11:04
source share

Alternative alternative.

Many of the other solutions use cropping to avoid an undefined gradient. Depending on your problem, clipping introduces bias and may be unacceptable in all cases. As the following code shows, we only need to handle the break point - not the area next to it.

Specific answer

 def cross_entropy(x, y, axis=-1): safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y) return -tf.reduce_sum(x * tf.log(safe_y), axis) def entropy(x, axis=-1): return cross_entropy(x, x, axis) 

But did it work?

 x = tf.constant([0.1, 0.2, 0., 0.7]) e = entropy(x) # ==> 0.80181855 g = tf.gradients(e, x)[0] # ==> array([1.30258512, 0.60943794, 0., -0.64332503], dtype=float32) Yay! No NaN. 

(Note: dup cross-post has been removed.)

General recipe

Use internal tf.where so that the function does not have asymptotes. That is, change the input to the inf generation function so that no inf can be created. Then use the second tf.where to always select a valid path code. That is, implement the mathematical condition in the way you are "usually", that is, a "naive" implementation.

In Python code, the recipe is:

Instead of this:

 tf.where(x_ok, f(x), safe_f(x)) 

Do it:

 safe_x = tf.where(x_ok, x, safe_x) tf.where(x_ok, f(safe_x), safe_f(x)) 

Example

Suppose you want to calculate:

 f(x) = { 1/x, x!=0 { 0, x=0 

A naive implementation results in NaNs in the gradient, i.e.

 def f(x): x_ok = tf.not_equal(x, 0.) f = lambda x: 1. / x safe_f = tf.zeros_like return tf.where(x_ok, f(x), safe_f(x)) 

Does he work?

 x = tf.constant([-1., 0, 1]) tf.gradients(f(x), x)[0].eval() # ==> array([ -1., nan, -1.], dtype=float32) # ...bah! We have a NaN at the asymptote despite not having # an asymptote in the non-differentiated result. 

The basic pattern for avoiding NaN gradients when using tf.where is to call tf.where twice. The innermost tf.where ensures that the result of f(x) always finite. The most external tf.where provides a choice of the correct result. For an example implementation, the trick is as follows:

 def safe_f(x): x_ok = tf.not_equal(x, 0.) f = lambda x: 1. / x safe_f = tf.zeros_like safe_x = tf.where(x_ok, x, tf.ones_like(x)) return tf.where(x_ok, f(safe_x), safe_f(x)) 

But did it work?

 x = tf.constant([-1., 0, 1]) tf.gradients(safe_f(x), x)[0].eval() # ==> array([-1., 0., -1.], dtype=float32) # ...yay! double-where trick worked. Notice that the gradient # is now a constant at the asymptote (as opposed to being NaN). 
+19
Feb 27 '17 at 23:08
source share

If y_conv is the result of softmax, say y_conv = tf.nn.softmax(x) , then an even better solution is to replace it with log_softmax :

 y = tf.nn.log_softmax(x) cross_entropy = -tf.reduce_sum(y_*y) 
+13
Jul 20 '16 at 19:52
source share

Sometimes you use tf.sqrt() without adding a small constant 1e-10 , causing this problem with nan .

+2
Oct 27 '18 at 2:44
source share

You are trying to calculate cross entropy using a standard formula. Not only is the value undefined at x=0 , it is also numerically unstable.

It is better to use tf.nn.softmax_cross_entropy_with_logits or if you really want to use a manually created formula until tf.clip_by_value zeros to a very small number in the log.

+1
Apr 29 '17 at 5:32 on
source share

Here is the implementation of binary (sigmoid) and categorical (softmax) cross-entropy losses in TensorFlow 1.1:

As you can see in the binary case, they consider some special cases to achieve numerical stability:

 # The logistic loss formula from above is # x - x * z + log(1 + exp(-x)) # For x < 0, a more numerically stable formula is # -x * z + log(1 + exp(x)) # Note that these two expressions can be combined into the following: # max(x, 0) - x * z + log(1 + exp(-abs(x))) # To allow computing gradients at zero, we define custom versions of max and # abs functions. zeros = array_ops.zeros_like(logits, dtype=logits.dtype) cond = (logits >= zeros) relu_logits = array_ops.where(cond, logits, zeros) neg_abs_logits = array_ops.where(cond, -logits, logits) return math_ops.add(relu_logits - logits * labels, math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name) 
+1
May 16 '17 at 9:37 a.m.
source share

I used LSTM for long sequences and got nan gradients. None of these answers helped me. But I came up with three own solutions. I hope they will be useful to other people who came here from a Google search.

  • Gradient cropping did not help me because the gradients turned nan into one batch update. In this case, you can replace nans with zeros with such lines:

     opt = tf.train.AdamOptimizer(args.lr) grads = opt.compute_gradients(loss) grads2 = [(tf.where(tf.is_nan(grad), tf.zeros(grad.shape), grad), var) for grad, var in grads] opt_op = opt.apply_gradients(grads2) 

    If you want to track the appearance of nans, you can use this code:

     was_nan = tf.reduce_any(tf.convert_to_tensor([tf.reduce_any(tf.is_nan(g)) for g in grads])) 
  • Replace LSTMCell with LayerNormBasicLSTMCell - a level standard LSTM cell - something similar to the batch rate between timesteps.

  • If you use regular re-drop of state, you can replace it with "Re-drop without memory loss". The code:

     LayerNormBasicLSTMCell(neurons, dropout_keep_prob=0.8) 

    Please note that you can also enable the cutoff function without normalizing the level:

     LayerNormBasicLSTMCell(neurons, layer_norm=False, dropout_keep_prob=0.8) 
+1
Dec 06 '17 at 19:33
source share

In addition to all the great answers above, I will add mine. This is a less common scenario, but it causes NaN: division by zero .

On my network, for the NLP task, there is a layer that the middle pool performs. Namely, each information is a sequence of tokens. My layer embeds the tokens and then calculates the average for the inline vector.

The average calculation is encoded as

 tf.reduce_sum(embedded)/tf.reduce_sum(tf.not_equal(input, pad)) 

Here pad is the dummy token I use in batch processing.

Now, if some data contains an empty list of tokens (for some reason), its length (the denominator in the above code fragment) will be 0. Then this causes the problem of dividing by zero, and NaN will remain at all the following optimization levels / stages. ,

In case someone ran into this problem, I used tf.where to flatten this length:

 sum_embedding = tf.reduce_sum(embedded, 1) embedding_length = tf.reduce_sum(tf.cast(tf.not_equal(input, pad), dtype=tf.float32), axis=1, keep_dims=True) embedding_length_smoothed = tf.where(tf.greater(embedding_length, 0.0), embedding_length, tf.ones(tf.shape(embedding_length))) avg_embedding = sum_embedding / embedding_length_smoothed 

In fact, it processes all this data with a list of tokens of zero length to length 1 and avoids the NaN problem.

+1
Jul 02
source share

Sometimes I got nans, and not sometimes, working in a standard direct connection network. Earlier I used the same TensorFlow code and it worked fine.

Turns out I accidentally imported variable names. Thus, as soon as the first row was selected in the batch (variable names), nanopotentials began. Maybe keep track of this?

0
Feb 26 '18 at 18:39
source share

I will add here one of my previous problems with NaNs. I used the sigmoid function as the activation of the last level of my network. However, the sigmoid activation function uses an exponential function to calculate, and I got some really big numbers introducing the sigmoid.

This led to endless gradients, and some NaNs began to appear.

0
Jul 02 '19 at 8:27
source share

I used the Tensorflow Estimator, which, in my opinion, takes into account this division by zero and other problems of numerical stability, and sometimes I get this error ( ERROR:tensorflow:Model diverged with loss = NaN during training ). Most of the time I get this because my input includes nan s. So: make sure your input data frames (or what you use) do not have NaN values ​​hidden somewhere in them.

0
Jul 12 '19 at 2:06
source share



All Articles