A)
In controlled learning tasks, the general optimization goal is the total loss for all training examples and is defined as E = \ sum_n loss (y_n, t_n), where n is the index for all training examples, y_n refers to the network output for the training example n, t_n is the label of the training example is n, and loss refers to the loss function. Note that y_n and t_n are generally vectorized quantities --- the length of the vector is determined by the number of output neurons in the network.
One of the possible options for the loss function is the quadratic error, defined as the loss (y, t) = \ sum_k (y_k - t_k) ^ 2, where k refers to the number of output neurons in the network. With back propagation, it is necessary to calculate the partial derivative of the general optimization goal with respect to the network parameters --- which are synaptic weights and neuron displacements. This is achieved using the following formula in accordance with the chain rule:
(\ partial E / \ partial w_ {ij}) = (\ partial E / \ partial out_j) * (\ partial out_j / \ partial in_j) * (\ partial in_j / partial w_ {ij}),
where w_ {ij} refers to the weight between neuron i and neuron j, out_j refers to the output of neuron j, and in_j refers to the input to neuron j.
How to calculate the output of the neuron out_j and its derivative with respect to the input of the neuron in_j depends on which activation function is used. If you use the liner activation function to calculate the neuron output out_j, the term (\ partial out_j / \ partial in_j) becomes 1. If you use, for example, the logistic function as an activation function, the term (\ partial out_j / \ (in_j) * ( 1 - sig (in_j)), where sig is the logistic function.
IN)
In the elastic backpropagation, the displacements are updated in the same way as the balance --- based on the sign of the partial derivatives and individually adjustable step sizes.
FROM)
I'm not quite sure if I understand correctly. The general goal of optimization is the scalar function of all network parameters, regardless of the number of output neurons. Therefore, there should be no confusion about how to calculate partial derivatives here.
In the general case, in order to calculate the partial derivative (\ partial E / \ partial w_ {ij}) of the general optimization goal E with respect to some weight w_ {ij}, one needs to calculate the partial derivative (\ partial out_k / \ partial w_ {ij}) of each neuron k relative to w_ {ij} as
(\ partial E / \ partial w_ {ij}) = \ sum_k (\ partial E / \ partial out_k) * (\ partial out_k / \ partial w_ {ij}).
Note, however, that the partial derivative (\ partial out_k / \ partial w_ {ij}) of the output neuron k with respect to w_ {ij} will be zero if w_ {ij} does not affect the output out_k of the output neuron k.
One more thing. In the case of using the quadratic error as a loss function, the partial derivative (\ partial E / \ partial out_k) of the overall optimization goal E with respect to the output out_k of some output neuron k is
(\ partial E / \ partial out_k) = \ sum_k 2 * (out_k - t_k),
where the value (out_k - t_k) is called an error attached to the output module k, and where I suggested only one training example with the label t for the convenience of writing. Note that if w_ {ij} does not affect the output out_k of the output neuron k, then updating w_ {ij} will not depend on the error (out_k - t_k), because (\ partial out_k / \ partial w_ {ij}) = 0 , as mentioned above.
One final note to avoid confusion. y_k and out_k are referred to as the output of the output neuron k in the network.