Conflicting behavior in variations of using tf.cond with tf.nn.static_state_saving_rnn

Model.txt Training.txt

System Information

  • I wrote my own code (as opposed to using the stock script example provided in TensorFlow) : custom
  • Platform and OS distribution (e.g. Linux Ubuntu 16.04) : Linux Ubuntu 16.04
  • TensorFlow set from (source or binary) : binary
  • TensorFlow version (use the command below) : 1.2 / 1.3 / 1.4 (tested on all)
  • Python version : 2.7
  • CUDA / cuDNN version : Cuda 8, CuDNN 6
  • GPU model and memory : GeForce GTX 1080, 12 GB.
  • Exact command to play : python Training.py

Problem

I am dealing with long serial data that must be transmitted to RNN. To perform truncated BPTT and batch processing, I use the tf.contrib.training.batch_sequences_with_states API with the tf.nn.static_state_saving_rnn API to pass RNN status information to subsequent segments of the same sequence. I use tf.RandomShuffleQueue () to store my data and separate input / output from training. I run enqueue operations asynchronously in another thread.

To facilitate the testing run after each training period, I use two separate structures tf.RandomShuffleQueue()and, therefore, two different instance tf.contrib.training.batch_sequences_with_states()and tf.nn.static_state_saving_rnn()the data train / test, respectively. Only the RNN cell that is passed to the instances tf.nn.static_state_saving_rnnremains unchanged, so the modified set of weights is used during testing.

In addition, I use a placeholder, which is a logical flag, with the help of which the corresponding nodes in the calculation column are switched on accordingly to the train / test time. This switch is performed using an operation tf.cond().

Situation 1

The problem lies in the situation of a deadlock at a certain stage between queue operations and training operations operating in separate flows. Queue queue timeouts most often occur because the queue has reached its maximum capacity, and for some reason, the training operation never returns and waits for more data to be received, and therefore, the deactivation operation will not be called.

Situation 2

In the Model.py file, if I uncomment the lines from 97-101 and the comment lines 104, then there is no such deadlock situation. The only difference is that a specific operation is written tf.cond(). One of them is in declarative form (working code), and the other in embedded form (broken / deadlock code).

Situation 3

- Training.py gen_data() ( - 43-48) 61. - . , tf.contrib.training.batch_sequences_with_states() (.. ), 1 2, .

, , tf.cond() tf.nn.static_state_saving_rnn() - , .

/​​

( .txt, .py) -

  • Model.txt - Model inference(), .
  • Training.txt - , sess.run()

1, , , - python Training.py

+4

Source: https://habr.com/ru/post/1689315/


All Articles