Working with variable-length text in Tensorflow

I am creating a Tensorflow model to perform the output of text phrases. For simplicity, suppose I need a classifier with a fixed number of output classes, but the text is of variable length in the input. In other words, my mini-party will be a sequence of phrases, but not all phrases have the same length.

data = ['hello',
        'my name is Mark',
        'What is your name?']

My first preprocessing step was to build a dictionary of all possible words in the dictionary and put each word on its whole word-Id. Entrance will be:

data = [[1],
        [2, 3, 4, 5],
        [6, 4, 7, 3]

What is the best way to handle this input? Can tf.placeholder () process variables in the same data packet? Or do I need to fill in all the lines so that they have the same length equal to the length of the longest line, using some kind of placeholder for the missing words? This seems to be very inefficient if some string is much larger than most others.

- EDIT -

Here is a concrete example.

When I know the size of my datapoints (and all datapoints are the same length, for example 3), I usually use something like:

input = tf.placeholder(tf.int32, shape=(None, 3)

with tf.Session() as sess:
  print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))

where the first placeholder size is the size of the minibar.

What if input sequences are words in sentences of different lengths?

feed_dict={input:[[1, 2, 3], [1]]}
+4
4

, . , .

TensorFlow ( ).

( [3]):

import tensorflow as tf
lines = tf.constant([
    'Hello',
    'my name is also Mark',
    'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']

, , ( .)

words = tf.string_split(lines," ")

( [3,7]). - [ , ]. :

indices    values
 0 0       'hello'
 1 0       'my'
 1 1       'name'
 1 2       'is'
 ...

:

table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)

, .

, :

line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position, 
                         segment_ids = line_number)+1

, , , lstm... - ( ):

EMBEDDING_DIM = 100

dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)

[3,7,100], [lines, words, embedding_dim].

lstm:

LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)

, .

outputs, final_state = tf.nn.dynamic_rnn(
    cell=lstm,
    inputs=embedded,
    sequence_length=lengths,
    dtype=tf.float32)

[3,7,50] [line, word, lstm_size]. , (! !) select_last_activations:

from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))

, . [3,50] [line, lstm_size]

init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
    init_t.run()
    init.run()
    print(final_output.eval().shape())

, , , , tf.contrib.learn.DynamicRnnEstimator.

+1

. , 32 ( ), , . NULL . , , , NULL . , ", ?". ", ? NULL NULL NULL NULL... NULL". , , , NULL, .

, . , . NULL ( , , ) . , , tf.dynamic_rnn, . , .

, , tenorflow Seq2Seq . , , . , .

0

, ( , 100% , ):

vocab dict, - , , K, "<PAD>" ( , )

:

x_batch = tf.placeholder(tf.int32, shape=(batch_size, None))

None - // -.

, , , -. :

: x_batch = [[1], [1,2,3], [4,5]] : len_batch = [1, 3, 2]

len_batch (l_max) -, . l_max=3 , :

mask = [
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
]

, , , .

, .

0

? ( , , , .) BOW.

  • tf.string
  • tf.string_split
  • , tf.contrib.lookup.string_to_index_table_from_file tf.contrib.lookup.string_to_index_table_from_tensor. .
  • .
    word_embeddings = tf.get_variable("word_embeddings",
                                      [vocabulary_size, embedding_size])
    embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
  1. Combine attachments. And you get a fixed length tensor (= attachment size). Perhaps you can choose a different method, then sum. ( avg, meanor something else)

Maybe it's too late :) Good luck.

0
source

Source: https://habr.com/ru/post/1649320/


All Articles