Determinism in tensor flow gradient updates?

Question

Determinism in tensor flow gradient updates?

So, I have a very simple NN script written in Tensorflow, and it’s hard for me to find the paths where some kind of “randomness” comes from.

I recorded

Weight,
Gradients
logit

of my network when I train, and for the first iteration, it’s clear that it all starts with the same. I have a SEED value for both reading data and a SEED value for initializing network weights. Those whom I will never change.

My problem is that, say, the second iteration of each restart, I start to see how the gradients diverge (by a small amount, for example, 1e-6 or so). However, over time, this, of course, leads to irreproducible behavior.

What could be the reason for this? I don’t know where any possible source of chance might come from ...

thanks

+3

floating-point random precision tensorflow random-seed

Spacey Oct 08 '16 at 23:06

source share

2 answers

It is known that the tensor flow reduce_sum op is non-deterministic. In addition, reduce_sum is used to calculate offset gradients.

This post discusses a workaround to avoid using reduce_sum (i.e. taking the point product of any vector w / all 1 vector is the same as reduce_sum)

0

Dankmasterdan Mar 13 '18 at 17:17

source share

Yaroslav bulatov · Accepted Answer · 2016-10-08T23:43:26+0000

There is a good chance that you can get deterministic results if you run your network on a CPU ( export CUDA_VISIBLE_DEVICES= ), with one thread in the native thread pool ( tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1) ), one thread Python (without the multi-threaded queues that you get from ops, for example tf.batch ), and one well-defined order of operations.In addition, using inter_op_parallelism_threads=1 may help in some scenarios.

One problem is that floating point addition / multiplication is not associative, so one sure-fire way to get deterministic results is to use integer arithmetic or quantized values.

The ban is that you can highlight which operation is non-deterministic, and try to avoid using this op. For example, there is tf.add_n op, which says nothing about the order in which it sums up the values, but different orders give different results.

Getting deterministic results is a battle in a battle because determinism conflicts with performance, and performance is usually the goal that attracts more attention. An alternative to trying to have the same numbers on restart is to focus on numerical stability - if your algorithm is stable, you will get reproducible results (i.e. the same number of erroneous estimates), although the exact values of the parameters may vary slightly

Determinism in tensor flow gradient updates?

More articles: