CNTK: how to create a sequence of tensors from one tensor?

Question

CNTK: how to create a sequence of tensors from one tensor?

I have a working TensorFlow seq2seq model that I used to subtitle images that I would like to convert to CNTK, but I had problems inputting the input into LSTM in the required format.

Here is what I do on my TensorFlow network:

max_seq_length = 40 embedding_size = 512 self.x_img = tf.placeholder(tf.float32, [None, 2048]) self.x_txt = tf.placeholder(tf.int32, [None, max_seq_length]) self.y = tf.placeholder(tf.int32, [None, max_seq_length]) with tf.device("/cpu:0"): image_embed_inputs = tf.layers.dense(inputs=self.x_img, units=embedding_size) image_embed_inputs = tf.reshape(image_embed_inputs, [-1, 1, embedding_size]) image_embed_inputs = tf.contrib.layers.batch_norm(image_embed_inputs, center=True, scale=True, is_training=is_training, scope='bn') text_embedding = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -init_scale, init_scale)) text_embed_inputs = tf.nn.embedding_lookup(text_embedding, self.x_txt) inputs = tf.concat([image_embed_inputs, text_embed_inputs], 1)

I basically do this:

I take the last 2048-dimensional layer of a preprocessed 50-layer ResNet as part of my input. Then I embed this in 512-dimensional space through the base dense layer ( image_embed_inputs ).

At the same time, I have a 40-element long sequence of text tokens ( x_txt ) that I embed in a 512-dimensional space ( text_embedding / text_embed_inputs ).

Then I combine them into the tensor [-1, 41, 512] , which is the actual contribution to my LSTM. The first element ( [-1, 0, 512] ) is the image attachment, and the remaining 40 elements ( [-1, 1:41, 512] ) are the attachments for each text token in my input sequence.

Ultimately, this works and does what I need for this in TensorFlow. Now I would like to do something similar in CNTK. I am looking at the seq2seq tutorial , but have not yet figured out how to configure the input for my CNTK LSTM.

I adopted the 2048-dimensional ResNet embedding, the 40-dimensional sequence of text tokens and the 40-accurate sequence of text label tokens and saved them in the CTF text format (combining the ResNet embedding and the sequence of input text tokens together), so they can be read like this:

 def create_reader(path, is_training, input_dim, label_dim): return MinibatchSource(CTFDeserializer(path, StreamDefs( features=StreamDef(field='x', shape=2088, is_sparse=True), labels=StreamDef(field='y', shape=40, is_sparse=False) )), randomize=is_training, max_sweeps=INFINITELY_REPEAT if is_training else 1)

What I would like to do during the training / testing is to accept the features input tensor, drop it into the 2048-dimensional embedding of ResNet and the 40-point input text sequence of tokens, and then configure the CNTK sequence of the entity to feed into my network. So far, however, I have not been able to figure out how to do this. This is where I am:

 def lstm(input, embedding_dim, LSTM_dim, cell_dim, vocab_dim): x_image = C.slice(input, 0, 0, 2048) x_text = C.slice(input, 0, 2048, 2088) x_text_seq = sequence.input_variable(shape=[vocab_dim], is_sparse=False) # How do I get the data from x_text into x_text_seq? image_embedding = Embedding(embedding_dim) text_embedding = Embedding(x_text_seq) lstm_input = C.splice(image_embedding, text_embedding)