How should we enter a text sequence in keras using pad_sequences?

I coded the sequence for the LSTM training sequence in keras myself, using the knowledge gained from web lessons and my own intuitions. I have converted your sample text in sequence, and then supplemented with the function pad_sequence in keras.

 from keras.preprocessing.text import Tokenizer,base_filter from keras.preprocessing.sequence import pad_sequences def shift(seq, n): n = n % len(seq) return seq[n:] + seq[:n] txt="abcdefghijklmn"*100 tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ") tk.fit_on_texts(txt) x = tk.texts_to_sequences(txt) #shifing to left y = shift(x,1) #padding sequence max_len = 100 max_features=len(tk.word_counts) X = pad_sequences(x, maxlen=max_len) Y = pad_sequences(y, maxlen=max_len) 

After a thorough check, I found that my padded sequence looks like this:

 >>> X[0:6] array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]], dtype=int32) >>> X array([[ 0, 0, 0, ..., 0, 0, 1], [ 0, 0, 0, ..., 0, 0, 3], [ 0, 0, 0, ..., 0, 0, 2], ..., [ 0, 0, 0, ..., 0, 0, 13], [ 0, 0, 0, ..., 0, 0, 12], [ 0, 0, 0, ..., 0, 0, 14]], dtype=int32) 

Is the augmented sequence supposed to look like this? With the exception of the last column in the array, the rest are all zeros. I think I made a mistake in filling the text before the sequence, and if so, can you tell me where I made the mistake?

+6
source share
2 answers

If you want tokenize to char, you can do it manually, it is not too complicated:

First create a dictionary for your characters:

 txt="abcdefghijklmn"*100 vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))} vocab_char['<PAD>'] = 0 

This will bind a different number for each character in your txt. The character with index 0 must be saved to fill.

Having a reverse dictionary will be useful for decoding output.

 rvocab = {v: k for k, v in vocab.items()} 

After that, you can first break the text into sequences, say you want to have sequences of length seq_len = 13 :

 [[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)] 

your output will look like this:

 [[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3], [14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4], ..., [2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8], [7, 2, 1, 5, 13, 11, 4, 3, 14]] 

Please note that the last sequence does not have the same length, you can drop it or fill your sequence up to max_len = 13, it will add 0 to it.

You can build your Y goals the same way by moving everything to 1. :-)

Hope this helps.

+6
source

The problem is in this line:

 tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ") 

When you establish such a split (through " " ), due to the nature of your data, you will receive each sequence consisting of one word. This is why your padded sequences have only one nonzero element. To change this attempt:

 txt="abcdefghijklmn "*100 
+3
source

Source: https://habr.com/ru/post/1014647/


All Articles