How does "MonitoredTrainingSession ()" work with "recovery" and "test mode"?

Question

How does "MonitoredTrainingSession ()" work with "recovery" and "test mode"?

In Tensorflow, we could create and create several Tensorflow sessions, using Between-graph Replicationfor distributed learning. MonitoredTrainingSession()coordinating multiple sessions Tensorflow, and MonitoredTrainingSession()there is an argument checkpoint_dirfor MonitoredTrainingSession()to recover the session / Tensorflow schedule. Now I have the following questions:

Usually we use an object tf.train.Saver()to restore Tensorflow graphs to saver.restore(...). But how to restore them using MonitoredTrainingSession()?
Since we run several processes and each process builds and creates a Tensorflow session for training, I wonder if we also need to run several processes for testing (or prediction) after training. In other words, how MonitoredTrainingSession()does it work with the test (or prediction) mode?

I read the Tensorflow Doc but did not find the answers to these 2 questions. I really appreciate if anyone has a solution. Thanks!

+6

python session tensorflow restore distributed

Ruofan kong Mar 29 '17 at 10:04

source share

2 answers

Andreas Forslöw · Answer 1 · 2018-02-08T22:15:04+0000

Short answer:

You need to pass the global step to the optimizer, which you pass mon_sess.run. This allows you to save and retrieve saved control points.
+ MonitoredTrainingSession. -, ( , ). -, - to mon_sess.run() - , (/, ) . , , , test_loss (/ , ). , , .

:

, , tf.train.MonitoredSession(tf.train.MonitoredTrainingSession tf.train.MonitoredSession, ).

, , 5 './ckpt_dir'. , :

def train(inputs, labels_onehot, global_step):
    out = tf.contrib.layers.fully_connected(
                            inputs,
                            num_outputs=10,
                            activation_fn=tf.nn.sigmoid)
    loss = tf.reduce_mean(
             tf.reduce_sum(
                tf.nn.sigmoid_cross_entropy_with_logits(
                            logits=out,
                            labels=labels_onehot), axis=1))
    train_op = opt.minimize(loss, global_step=global_step)
    return train_op

with tf.Graph().as_default():
    global_step = tf.train.get_or_create_global_step()
    inputs = ...
    labels_onehot = ...
    train_op = train(inputs, labels_onehot, global_step)

    with tf.train.MonitoredTrainingSession(
        checkpoint_dir='./ckpt_dir',
        save_checkpoint_secs=5,
        hooks=[ ... ] # Choose your hooks
    ) as mon_sess:
        while not mon_sess.should_stop():
            mon_sess.run(train_op)

MonitoredTrainingSession, , :

tf.train.MonitoredTrainingSession tf.train.Scaffold, ; , , .
tf.train.ChiefSessionCreator. , , tf . , , , , ..
tf.train.CheckpointSaverHook, .

, tf.train.CheckpointSaverHook tf.train.ChiefSessionCreator . tf.train.MonitoredTrainingSession , :

checkpoint_dir = './ckpt_dir'

scaffold = tf.train.Scaffold()
saverhook = tf.train.CheckpointSaverHook(
    checkpoint_dir=checkpoint_dir,
    save_secs=5
    scaffold=scaffold
)
session_creator = tf.train.ChiefSessionCreator(
    scaffold=scaffold,
    checkpoint_dir=checkpoint_dir
)

with tf.train.MonitoredSession(
    session_creator=session_creator,
    hooks=[saverhook]) as mon_sess:
        while not mon_sess.should_stop():
            mon_sess.run(train_op)

+ , , ( while):

mon_sess.run([train_op, cross_validation_loss])

, validation_loss . , , .

Misha E · Answer 2 · 2017-04-17T14:10:51+0000

, . Int API , MonitoredTrainingSession MonitoredSession, "... , ..."
tf.contrib.learn.Estimator(..).predict(..) tf.contrib.learn.Estimator(..)._infer_model(..) . MonitoredSession.

How does "MonitoredTrainingSession ()" work with "recovery" and "test mode"?

More articles: