Get dataset length in Tensorflow

source_dataset = tf.data.TextLineDataset('primary.csv')
target_dataset = tf.data.TextLineDataset('secondary.csv')
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
dataset = dataset.shard(10000, 0)
dataset = dataset.map(lambda source, target: (tf.string_to_number(tf.string_split([source], delimiter=',').values, tf.int32),
                                              tf.string_to_number(tf.string_split([target], delimiter=',').values, tf.int32)))
dataset = dataset.map(lambda source, target: (source, tf.concat(([start_token], target), axis=0), tf.concat((target, [end_token]), axis=0)))
dataset = dataset.map(lambda source, target_in, target_out: (source, tf.size(source), target_in, target_out, tf.size(target_in)))

dataset = dataset.shuffle(NUM_SAMPLES)  #This is the important line of code

I would like to completely shuffle my entire dataset, but shuffle()requires multiple samples to pull, tf.Size()not work with tf.data.Dataset.

How to shuffle?

+4
source share
1 answer

I worked with tf.data.FixedLengthRecordDataset () and ran into a similar problem. In my case, I tried to use a certain percentage of raw data. Since I knew that all records have a fixed length, the workaround for me was:

totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)

, python 'primary.csv' 'secondary.csv'. , , , buffer_size, . buffer_size , , , . (, , ), .

+1

Source: https://habr.com/ru/post/1690501/


All Articles