Oversampling functionality in the Tensorflow dataset API

I would like to ask if the current dataset API allows the oversampling algorithm to be implemented? I am dealing with an unbalanced class problem. I thought it would be nice to override certain classes while parsing a dataset, i.e. Online generation. I saw the implementation of the rejection_resample function, however, this deletes the samples, not duplicates them, and also slows down the creation of packages (when the distribution of targets is very different from the original). What I would like to achieve is: take an example, look at its class probability, decide whether to duplicate it or not. Then call dataset.shuffle(...) dataset.batch(...)and get an iterator. The best (in my opinion) approach would be overfulfillment of low-potential classes and a subsample of the most probable ones. I would like to do this online as it is more flexible.

+7
source share
3 answers

This issue has been resolved in release # 14451 . Just post anwser here to make it more visible to other developers.

The code example is an oversample of low frequent classes and a low sample of frequent ones, where class_target_prob- this is just an even distribution in my case. I would like to check some conclusions from a recent manuscript . A systematic study of the problem of class imbalance in convolutional neural networks

Oversampling of certain classes is done by calling:

dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

Here is the complete snippet that does everything:

# sampling parameters
oversampling_coef = 0.9  # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5  # if equal to 0 then undersampling_filter() always returns True

def oversample_classes(example):
    """
    Returns the number of copies of given example
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    # soften ratio is oversampling_coef==0 we recover original distribution
    prob_ratio = prob_ratio ** oversampling_coef 
    # for classes with probability higher than class_target_prob we
    # want to return 1
    prob_ratio = tf.maximum(prob_ratio, 1) 
    # for low probability classes this number will be very large
    repeat_count = tf.floor(prob_ratio)
    # prob_ratio can be e.g 1.9 which means that there is still 90%
    # of change that we should return 2 instead of 1
    repeat_residual = prob_ratio - repeat_count # a number between 0-1
    residual_acceptance = tf.less_equal(
                        tf.random_uniform([], dtype=tf.float32), repeat_residual
    )

    residual_acceptance = tf.cast(residual_acceptance, tf.int64)
    repeat_count = tf.cast(repeat_count, dtype=tf.int64)

    return repeat_count + residual_acceptance


def undersampling_filter(example):
    """
    Computes if given example is rejected or not.
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    prob_ratio = prob_ratio ** undersampling_coef
    prob_ratio = tf.minimum(prob_ratio, 1.0)

    acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)

    return acceptance


dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

dataset = dataset.filter(undersampling_filter)

dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)

sess.run(tf.global_variables_initializer())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

Update # 1

Here is a simple jupyter laptop that implements the above oversampling / de-sampling on a toy model.

+8
source

tf.data.experimental.rejection_resample, , , "class_prob" "class_target_prob".
, , , .

+1

QnA . .

https://vallum.imtqy.com/Optimizing_parallel_performance_of_resampling_with_tensorflow.html

, -, Tensorflow , .

, , , .

 dataset = dataset.map(undersample_filter_fn, num_parallel_calls=num_parallel_calls) 
 dataset = dataset.flat_map(lambda x : x) 

flat_map with lambda identifier function is intended only for combining surviving (and empty) records

# Pseudo-code for understanding of flat_map after maps
#parallel calls of map('A'), map('B'), and map('C')
map('A') = 'AAAAA' # replication of A 5 times
map('B') = ''      # B is dropped
map('C') = 'CC'    # replication of C twice
# merging all map results
flat_map('AAAA,,CC') = 'AAAACC'
0
source

Source: https://habr.com/ru/post/1689106/


All Articles