How does "MonitoredTrainingSession ()" work with "recovery" and "test mode"?

In Tensorflow, we could create and create several Tensorflow sessions, using Between-graph Replicationfor distributed learning. MonitoredTrainingSession()coordinating multiple sessions Tensorflow, and MonitoredTrainingSession()there is an argument checkpoint_dirfor MonitoredTrainingSession()to recover the session / Tensorflow schedule. Now I have the following questions:

  • Usually we use an object tf.train.Saver()to restore Tensorflow graphs to saver.restore(...). But how to restore them using MonitoredTrainingSession()?
  • Since we run several processes and each process builds and creates a Tensorflow session for training, I wonder if we also need to run several processes for testing (or prediction) after training. In other words, how MonitoredTrainingSession()does it work with the test (or prediction) mode?

I read the Tensorflow Doc but did not find the answers to these 2 questions. I really appreciate if anyone has a solution. Thanks!

+6
source share
2 answers

Short answer:

  • You need to pass the global step to the optimizer, which you pass mon_sess.run. This allows you to save and retrieve saved control points.
  • + MonitoredTrainingSession. -, ( , ). -, - to mon_sess.run() - , (/, ) . , , , test_loss (/ , ). , , .

:

, , tf.train.MonitoredSession(tf.train.MonitoredTrainingSession tf.train.MonitoredSession, ).

, , 5 './ckpt_dir'. , :

def train(inputs, labels_onehot, global_step):
    out = tf.contrib.layers.fully_connected(
                            inputs,
                            num_outputs=10,
                            activation_fn=tf.nn.sigmoid)
    loss = tf.reduce_mean(
             tf.reduce_sum(
                tf.nn.sigmoid_cross_entropy_with_logits(
                            logits=out,
                            labels=labels_onehot), axis=1))
    train_op = opt.minimize(loss, global_step=global_step)
    return train_op

with tf.Graph().as_default():
    global_step = tf.train.get_or_create_global_step()
    inputs = ...
    labels_onehot = ...
    train_op = train(inputs, labels_onehot, global_step)

    with tf.train.MonitoredTrainingSession(
        checkpoint_dir='./ckpt_dir',
        save_checkpoint_secs=5,
        hooks=[ ... ] # Choose your hooks
    ) as mon_sess:
        while not mon_sess.should_stop():
            mon_sess.run(train_op)

MonitoredTrainingSession, , :

, tf.train.CheckpointSaverHook tf.train.ChiefSessionCreator . tf.train.MonitoredTrainingSession , :

checkpoint_dir = './ckpt_dir'

scaffold = tf.train.Scaffold()
saverhook = tf.train.CheckpointSaverHook(
    checkpoint_dir=checkpoint_dir,
    save_secs=5
    scaffold=scaffold
)
session_creator = tf.train.ChiefSessionCreator(
    scaffold=scaffold,
    checkpoint_dir=checkpoint_dir
)

with tf.train.MonitoredSession(
    session_creator=session_creator,
    hooks=[saverhook]) as mon_sess:
        while not mon_sess.should_stop():
            mon_sess.run(train_op)

+ , , ( while):

mon_sess.run([train_op, cross_validation_loss])

, validation_loss . , , .

0
  • , . Int API , MonitoredTrainingSession MonitoredSession, "... , ..."

  • tf.contrib.learn.Estimator(..).predict(..) tf.contrib.learn.Estimator(..)._infer_model(..) . MonitoredSession.

-1

Source: https://habr.com/ru/post/1016151/


All Articles