Determinism in tensor flow gradient updates?

So, I have a very simple NN script written in Tensorflow, and it’s hard for me to find the paths where some kind of “randomness” comes from.

I recorded

  • Weight,
  • Gradients
  • logit

of my network when I train, and for the first iteration, it’s clear that it all starts with the same. I have a SEED value for both reading data and a SEED value for initializing network weights. Those whom I will never change.

My problem is that, say, the second iteration of each restart, I start to see how the gradients diverge (by a small amount, for example, 1e-6 or so). However, over time, this, of course, leads to irreproducible behavior.

What could be the reason for this? I don’t know where any possible source of chance might come from ...

thanks

+3
source share
2 answers

There is a good chance that you can get deterministic results if you run your network on a CPU ( export CUDA_VISIBLE_DEVICES= ), with one thread in the native thread pool ( tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1) ), one thread Python (without the multi-threaded queues that you get from ops, for example tf.batch ), and one well-defined order of operations.In addition, using inter_op_parallelism_threads=1 may help in some scenarios.

One problem is that floating point addition / multiplication is not associative, so one sure-fire way to get deterministic results is to use integer arithmetic or quantized values.

The ban is that you can highlight which operation is non-deterministic, and try to avoid using this op. For example, there is tf.add_n op, which says nothing about the order in which it sums up the values, but different orders give different results.

Getting deterministic results is a battle in a battle because determinism conflicts with performance, and performance is usually the goal that attracts more attention. An alternative to trying to have the same numbers on restart is to focus on numerical stability - if your algorithm is stable, you will get reproducible results (i.e. the same number of erroneous estimates), although the exact values ​​of the parameters may vary slightly

+5
source

It is known that the tensor flow reduce_sum op is non-deterministic. In addition, reduce_sum is used to calculate offset gradients.

This post discusses a workaround to avoid using reduce_sum (i.e. taking the point product of any vector w / all 1 vector is the same as reduce_sum)

0
source

Source: https://habr.com/ru/post/1264619/


All Articles