How to deal with a large (> 2 GB) embedded search table in a tensor stream?

When I use pre-prepared vocabulary vectors for classification using LSTM, I wondered how to work with an embed search table in excess of 2gb in a tensor stream.

To do this, I tried to create an embed search lookup table, such as the code below,

data = tf.nn.embedding_lookup(vector_array, input_data)

got this value error.

ValueError: Cannot create a tensor proto whose content is larger than 2GB

the vector_array variable in the code is a numpy array and contains about 14 million unique tokens and dimension vector vectors for each word.

Thank you for your help in

+5
source share
2 answers

For me, the accepted answer does not seem to work. Although there are no errors, the results were terrible (compared to a smaller attachment via direct initialization), and I suspect that the attachments were only constant 0, tf.Variable () is initialized with.

Using only placeholder without additional variable

 self.Wembed = tf.placeholder( tf.float32, self.embeddings.shape, name='Wembed') 

and then feed the embed into each .run () session of the graphic, it seems to work.

+2
source

You need to copy it to the tf variable. There's a great answer to this question in StackOverflow: Using pre-prepared word embedding (word2vec or Glove) in TensorFlow

Here is how I did it:

 embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]),trainable=False, name="embedding_weights") embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM]) embedding_init = embedding_weights.assign(embedding_placeholder) sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix}) 

Then you can use the embedding_weights variable to do the search (don't forget to keep the index index display)

Update. The use of a variable is not required, but it allows you to save it for future use so that you do not have to do all this again (it takes some time to load very large attachments on my laptop). If this is not important, you can simply use placeholders, for example, Nicklas Schnelle suggested

+7
source

Source: https://habr.com/ru/post/1272565/


All Articles