Part 2 Sustainable neural network backpropagation

This is the next question of this post . For a given neuron, I don’t understand how to take the partial derivative of its error and the partial derivative of weight.

Working from this web page , it clearly shows how the proposition works (although I deal with Sustainable Distribution). For the Feedforward neural network, we must: 1) move forward through the neural network, launch neurons, 2) from the neurons of the output layer, calculate the total error. Then 3) moving backward, push this error for each weight in the neuron, then 4) go forward again, update the weights in each neuron.

That is why I do not understand.

A) For each neuron, how do you calculate the partial derivative ( determination ) of error above the partial derivative of weight? My confusion is that in calculus the partial derivative is calculated in terms of n variable functions. I kind of understand ldog and Bayer's answers to this post . And I even understand the rules of the chain. But it is not a gel when I think exactly how to apply it to the results of i) a linear adder and ii) sigmoid activation.

B) Using the Resilient propogation approach, how would you change the bias in a given neuron? Or is there no bias or threshold in NN using Resilient Propagation skills training?

C) How do you propagate a common mistake if there are two or more output neurons? Does the weight of the neurons have a complete error * for each value of the output neuron?

thanks

+6
source share
3 answers

Not 100% sure at other points, but I can answer B at the moment:

B) The slope is updated depending on the direction of the partial derivative, and not on the magnitude. the weight update size increases if the direction remains unchanged for successive iterations. oscillating directions will reduce the size of the update. http://nopr.niscair.res.in/bitstream/123456789/8460/1/IJEMS%2012(5)%20434-442.pdf

0
source

For me (also thinking about terms of calculus and symbolic equations) a thing with derivatives did a click only after I realized that the whole thing was to put the function in itself, and thus avoid the process of differentiation as such.

A few examples (python) may help ...

If I have a linear activation function:

def f_act( x ): return x 

then the derivative is easy, wherever I need d (f_act) , I put 1 :

 def der_f_act( y ): return 1 

Similarly, if I have a logistic activation function:

f_a = 1 / (1 + e ^ (- x))

then the derivative can be written in terms of the function itself ( here details ) as:

d (f_a) = f_a (1 - f_a)

Everything that can be encoded as:

 def f_act( x ): return 1 / ( 1 + numpy.exp(-1*x) ) def der_f_act( y ): return y * ( 1 - y ) 

For these examples, I already had the value of the activation function (from the forward phase), so I can benefit from this and just calculate at this point;)

This is one reason to prefer certain activation functions: some have very convenient derivatives, which makes them easier and more efficient, especially if you're talking about a bunch of nodes in neural networks.

0
source

A)

In controlled learning tasks, the general optimization goal is the total loss for all training examples and is defined as E = \ sum_n loss (y_n, t_n), where n is the index for all training examples, y_n refers to the network output for the training example n, t_n is the label of the training example is n, and loss refers to the loss function. Note that y_n and t_n are generally vectorized quantities --- the length of the vector is determined by the number of output neurons in the network.

One of the possible options for the loss function is the quadratic error, defined as the loss (y, t) = \ sum_k (y_k - t_k) ^ 2, where k refers to the number of output neurons in the network. With back propagation, it is necessary to calculate the partial derivative of the general optimization goal with respect to the network parameters --- which are synaptic weights and neuron displacements. This is achieved using the following formula in accordance with the chain rule:

(\ partial E / \ partial w_ {ij}) = (\ partial E / \ partial out_j) * (\ partial out_j / \ partial in_j) * (\ partial in_j / partial w_ {ij}),

where w_ {ij} refers to the weight between neuron i and neuron j, out_j refers to the output of neuron j, and in_j refers to the input to neuron j.

How to calculate the output of the neuron out_j and its derivative with respect to the input of the neuron in_j depends on which activation function is used. If you use the liner activation function to calculate the neuron output out_j, the term (\ partial out_j / \ partial in_j) becomes 1. If you use, for example, the logistic function as an activation function, the term (\ partial out_j / \ (in_j) * ( 1 - sig (in_j)), where sig is the logistic function.

IN)

In the elastic backpropagation, the displacements are updated in the same way as the balance --- based on the sign of the partial derivatives and individually adjustable step sizes.

FROM)

I'm not quite sure if I understand correctly. The general goal of optimization is the scalar function of all network parameters, regardless of the number of output neurons. Therefore, there should be no confusion about how to calculate partial derivatives here.

In the general case, in order to calculate the partial derivative (\ partial E / \ partial w_ {ij}) of the general optimization goal E with respect to some weight w_ {ij}, one needs to calculate the partial derivative (\ partial out_k / \ partial w_ {ij}) of each neuron k relative to w_ {ij} as

(\ partial E / \ partial w_ {ij}) = \ sum_k (\ partial E / \ partial out_k) * (\ partial out_k / \ partial w_ {ij}).

Note, however, that the partial derivative (\ partial out_k / \ partial w_ {ij}) of the output neuron k with respect to w_ {ij} will be zero if w_ {ij} does not affect the output out_k of the output neuron k.

One more thing. In the case of using the quadratic error as a loss function, the partial derivative (\ partial E / \ partial out_k) of the overall optimization goal E with respect to the output out_k of some output neuron k is

(\ partial E / \ partial out_k) = \ sum_k 2 * (out_k - t_k),

where the value (out_k - t_k) is called an error attached to the output module k, and where I suggested only one training example with the label t for the convenience of writing. Note that if w_ {ij} does not affect the output out_k of the output neuron k, then updating w_ {ij} will not depend on the error (out_k - t_k), because (\ partial out_k / \ partial w_ {ij}) = 0 , as mentioned above.

One final note to avoid confusion. y_k and out_k are referred to as the output of the output neuron k in the network.

0
source

Source: https://habr.com/ru/post/896863/


All Articles