I realized that my patterns get different every time I train them, although I keep the random TensorFlow seed the same.
I checked that:
- Initialization is determined; the balance is identical before the first update.
- Inputs are deterministic. In fact, various direct calculations, including loss, are identical for the very first batch.
- Gradients for the first batch are different. Specifically, I am comparing the outputs of
tf.gradients(loss, train_variables) . While loss and train_variables have the same value, gradients sometimes differ for some of the variables. The differences are quite significant (sometimes the difference in the sum of the absolute differences for one gradient of the variable is greater than 1).
I conclude that this is a gradient calculation that causes non-determinism. I examined this issue and the problem persists when running on the processor with intra_op_parallelism_thread=1 and inter_op_parallelism_thread=1 .
How can a backward dip be non-deterministic if there is no front passage? How can I debug this further?
Georg source share