Trying to understand adadelta algorithm

I'm trying to inject adadelta into my simple direct neural network but I think I'm having trouble understanding this article. http://arxiv.org/pdf/1212.5701v1.pdf
This is a short article explaining / introducing the adadelta algorithm. Only 1 and a half pages are focused on formulas.

Starting from the part:

Algorithm 1 Calculation of the ADADELTA update at time t


Question 1 part: '3: Compute Gradient: gt'

How exactly do I calculate the gradient here? Am I right?

/* calculating gradient value for neuron what is inside the hidden layer
gradient = sum of ( outcoming connection target gradient * outcoming connection weight ) * derivative function */
double CalculatHiddenGradient() {
    double sum = 0.0;
    for (int i = 0; i < OutcomingConnections.size(); i++) {
        sum += OutcomingConnections[i]->weight * OutcomingConnections[i]->target->gradient;
    }
    return (1.0 - output * output) * sum; // tanh derivative function
}

// calculating gradient value for output neurons where we know the desired output value
double CalculatGradient(double TargetOutput) {
    return (TargetOutput - output) * (1.0 - output * output);
}

Question 2 part: '5: Calculate update: Δxt'

formula (14) says the following:

Δxt = - (RMS [Δx] t-1) / RMS [g] t) * gt;

represents RMS [Δx] t-1, calculated as follows:

RMS [Δx] t-1 = sqrt (E [Δx²] t-1 + e)

(9)?


, , :

class AdaDelta {
private:
    vector<double> Eg; // E[g²]
    vector<double> Ex; // E[∆x²]
    vector<double> g; // gradient
    int windowsize;
    double p; // Decay rate ρ
    double e; // Constant e, epsilon?

public:
    AdaDelta(int WindowSize = 32, double DecayRate = 0.95, double ConstantE = 0.001) { // initalizing variables
        Eg.reserve(WindowSize + 1);
        Ex.reserve(WindowSize + 1);

        Eg.push_back(0.0); // E[g²]t
        Ex.push_back(0.0); // E[∆x²]t
        g.push_back(0.0); // (gradient)t

        windowsize = WindowSize; // common value:?

        p = DecayRate; // common value:0.95
        e = ConstantE; // common value:0.001
    }

    // Does it return weight update value?
    double CalculateUpdated(double gradient) {
        double dx; // ∆xt
        int t;

        // for t = 1 : T do %% Loop over # of updates
        for (t = 1; t < Eg.size(); t++) {

            // Accumulate Gradient
            Eg[t] = (p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t]));

            // Compute Update
            dx = -(sqrt(Ex[t - 1] + e) / sqrt(Eg[t] + e)) * g[t];

            // Accumulate Updates
            Ex[t] = Ex[t - 1] + (1.0 - p) * (dx * dx);
        }

        /* calculate new update
        =================================== */
        t = g.size();
        g.push_back(gradient);

        // Accumulate Gradient
        Eg.push_back((p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t])));

        // Compute Update
        dx = -(sqrt(Ex[t - 1] + e) / sqrt(Eg[t] + e)) * g[t];

        // Accumulate Updates
        Ex.push_back(Ex[t - 1] + (1.0 - p) * (dx * dx));

        // Deleting adadelta update when window has grown bigger than we allow
        if (g.size() >= windowsize) {
            Eg[1] = 0.0;
            Ex[1] = 0.0;
            Eg.erase(Eg.begin());
            Ex.erase(Ex.begin());
            g.erase(g.begin());

        }
        return dx;
    }
};

3
backpropagation,

* *

adadelta . CalculateUpdated() , ?


4
, .

3,2. 2:

, . ?

(13) Δx = (Δx/∂f)/∂x;


5
Δx, ∂f, ∂x (13)?


!

+4
1

, AdaDelta, - -. - - , (, , ), , .

, ,

D = {(A_1,b_1), (A_2,b_2), (A_3,b_3), ...}

A_k - k- , b_k - . , (, , , , )

x = (x_1, x_2, ..., x_n)

(A_k, b_k) , .. , x . - , D .

"" --- , , , (RMS) .

- , , x :

x_new <- x_old - gradient(RMS[predicted-actual])

, AdaGrad AdaDelta, "" , , , , AdaDelta, x "" .

, :

  • 1:

(.. x )

gt = (∂f/∂x_1, ∂f/∂x_2, ..., ∂f/∂x_n) (xt)

f (x1, x2,..., x_n) - , ; x, . : xt.

  • 2:

, RMS -x

RMS[\Delta x]_{t-1} = \sqrt{ E[\Delta x^2]_{t-1} + \epsilon },

E[\Delta x^2]_t = \rho E[\Delta x^2]_{t-1} + (1-\rho) g^2_T,

E[\Delta x^2]_0 = 0.
  • 3:

AdaDelta - . :

(new_weights @ T) := (old_weights @ T-1) - [adaptive_learning_rate] * (gradient @ T)

adaptive_learning_rate := -(RMS[Delta-x] @ T-1)/(RMS[gradient] @ T)

AdaDelta , , .

  • 4:

" " , ; , - (/, /, / ..), , , .

  • 5:

Delta-x - x . , x_i - , x_ {i + 1} - , Delta-x is (x_ {i + 1} - x_i).

(∂f/∂x) - , ( ML, f - ).

+2

Source: https://habr.com/ru/post/1648425/


All Articles