The difference between tf.data.Dataset.map () and tf.data.Dataset.apply ()

When upgrading to version 1.4, Tensorflow included tf.data in the library core. One “new new feature” described in release 1.4 for release is tf.data.Dataset.apply() , which is the “method for applying custom conversion functions”. How does this differ from the existing tf.data.Dataset.map() ?

+5
source share
3 answers

The difference is that map will perform one function for each Dataset element separately, while apply will perform one function as a whole of Dataset once (for example, group_by_window , given as an example in the documentation).

The apply argument is a function that takes a Dataset and returns a Dataset when the map argument is a function that takes one element and returns one transformed element.

+8
source

Sunreef's answer is absolutely correct. You may still wonder why we introduced Dataset.apply() , and I thought I was suggesting some background.

The tf.data API has a set of basic transformations - for example, Dataset.map() and Dataset.filter() - which are usually useful in a wide range of data sets, are unlikely to change and will be implemented as methods of the tf.data.Dataset object. In particular, they are subject to the same backward compatibility guarantees as other core TensorFlow APIs.

However, the basic approach is a bit limiting. We also want the freedom to experiment with new transformations before adding them to the kernel, and let other library developers create their own reusable transformations. Therefore, in TensorFlow 1.4, we divided the set of user transformations that live in tf.contrib.data . Custom conversions include those that have very specific functionality (e.g. tf.contrib.data.sloppy_interleave() ), and some where the API is still in the stream (e.g. tf.contrib.data.group_by_window() ). Initially, we implemented these custom transformations as functions from Dataset to Dataset , which adversely affected the syntax flow of the pipeline. For instance:

 dataset = tf.data.TFRecordDataset(...).map(...) # Method chaining breaks when we apply a custom transformation. dataset = custom_transformation(dataset, x, y, z) dataset = dataset.shuffle(...).repeat(...).batch(...) 

Since this was apparently a common template, we added Dataset.apply() as a way to bind the main and user transformations in a single pipeline:

 dataset = (tf.data.TFRecordDataset(...) .map(...) .apply(custom_transformation(x, y, z)) .shuffle(...) .repeat(...) .batch(...)) 

This is a minor feature in the grandiose scheme of things, but I hope this makes it easier to read tf.data programs, and the library is easier to expand.

+5
source

I don't have enough reputation for comments, but I just wanted to point out that you can use a map to apply to multiple elements in a dataset contrary to @sunreef's comments on your own post.

According to the documentation, the card takes as an argument

map_func: a function that displays the nested structure of tensors (having forms and types defined by self.output_shapes and self.output_types) to another nested structure of tensors.

output_shapes parameters are defined by the data set and can be changed using api functions such as a package. So, for example, you can normalize a batch using only dataset.batch and .map with:

 dataset = dataset ... dataset.batch(batch_size) dataset.map(normalize_fn) 

It seems that the main utility of apply() is when you really want to do the conversion across the entire dataset.

+1
source

Source: https://habr.com/ru/post/1273100/


All Articles