How is the number of parameters associated with the BatchNormalization level equal to 2048?

I have the following code.

x = keras.layers.Input(batch_shape = (None, 4096)) hidden = keras.layers.Dense(512, activation = 'relu')(x) hidden = keras.layers.BatchNormalization()(hidden) hidden = keras.layers.Dropout(0.5)(hidden) predictions = keras.layers.Dense(80, activation = 'sigmoid')(hidden) mlp_model = keras.models.Model(input = [x], output = [predictions]) mlp_model.summary() 

And this is the model summary:

 ____________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ==================================================================================================== input_3 (InputLayer) (None, 4096) 0 ____________________________________________________________________________________________________ dense_1 (Dense) (None, 512) 2097664 input_3[0][0] ____________________________________________________________________________________________________ batchnormalization_1 (BatchNorma (None, 512) 2048 dense_1[0][0] ____________________________________________________________________________________________________ dropout_1 (Dropout) (None, 512) 0 batchnormalization_1[0][0] ____________________________________________________________________________________________________ dense_2 (Dense) (None, 80) 41040 dropout_1[0][0] ==================================================================================================== Total params: 2,140,752 Trainable params: 2,139,728 Non-trainable params: 1,024 ____________________________________________________________________________________________________ 

The input size for the BatchNormalization (BN) level is 512. According to Keras documentation , the output form for the BN layer is the same as the input, equal to 512.

Then, how is the number of parameters associated with the BN layer equal to 2048?

+5
source share
2 answers

Keras party normalization implements this article .

As you can read there, to ensure the normal operation of the party during training, they need to monitor the distribution of each normalized size. To do this, since by default you are in mode=0 , they calculate 4 parameters for each function on the previous layer. These parameters guarantee the correct distribution and reverse transmission of information.

So 4*512 = 2048 , this should answer your question.

+7
source

These 2048 parameters are actually [gamma weights, beta weights, moving_mean(non-trainable), moving_variance(non-trainable)] , each of which has 512 elements (input layer size).

+3
source

Source: https://habr.com/ru/post/1264857/


All Articles