Counter calculation time increases to 20 thousand steps

I noticed that during training, the tensor flow continues to slow down. I built the computation time for the sample during training, and I noticed that after the first 80 thousand were approximately constant, it continued to grow; what else seems to follow the pattern in which the computation time strikes every 20 thousand steps and remains constant between them.

calculation time

After 400 thousand steps, the calculation time increased from 1.46 to 25.11 ms per sample, an increase of x17, which is clearly not welcome.

When I stop training and resume from the last saved model, the calculation time per sample drops to ~ 1.46 ms, so this does not happen from the model.

Has anyone experienced the same problem, and what is the reason? (My next step will be done without saving the resume in order to try to find the problem at this level).

Refresh : When the summary is completed, the calculation time remains flat.

Update : the error generated by the summaries and tf.get_default_graph().finalize()is called before the training cycle.

Refresh and partial response.

It seems that inflation during the calculation is due to the use of a trace_levelwith a value tf.RunOptions.FULL_TRACEin evaluating my results.

Replacement

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
smry = sess.run(smry_op, feed_dict=feed_dict, options=run_options)

with

smry = sess.run(smry_op, feed_dict=feed_dict)

gets rid of the problem. Of course, the question remains why FULL_TRACE will incur such overhead, so I leave the question open.

+4
source share

Source: https://habr.com/ru/post/1675506/


All Articles