Gradient Descent Implementation

I implemented both batch and stochastic gradient descent. However, I am experiencing some problems. This is a stochastic rule:

1 to m { theta(j):=theta(j)-step*derivative (for all j) } 

The problem is that while the cost function is getting smaller and smaller, testing says that is not very good. If I changed the step a little and changed the number of iterations, the cost function is a little more in cost, but the results are fine. Is this an advanced "symptom"? How do I know which one is correct? :)

As I said, despite the fact that the cost function is more minimized, testing says that is not very good.

+4
source share
2 answers

Gradient descent is a local search method to minimize function. When he reaches a local minimum in the parameter space, he cannot go further. This makes gradient descent (and other local methods) prone to getting stuck at local lows, rather than reaching a global minimum. Local lows may or may not be good solutions for what you are trying to achieve. What to expect will depend on the function you are trying to minimize.

In particular, multidimensional NP-complete problems can be complex. They often have exponentially many local optima, many of which are almost as good as the global optimum in terms of cost, but with parameters orthogonal to the global optimum. These are difficult problems: you usually do not expect that you can find a global optimum, instead, you just look for a local minimum that is good enough. These are also actual problems: many interesting problems have only these properties.

I would suggest first testing the implementation of gradient descent with a slight problem. You can try to find a minimum in the polynomial. Since this is a one-parameter problem, you can plot the parameter values ​​along the polynomial curve. You should be able to see that something is completely wrong, and can also watch how the search gets stuck in local lows. You should also be able to see that choosing an initial parameter can make a big difference.

To solve more complex problems, you can change your algorithm to help it avoid local minima. A few common approaches:

  • Add some noise. This reduces the accuracy of the found parameters, which can β€œblur” local minima. Then the search can jump out of local minima that are small compared to noise, but still fall into deeper minima. A known approach to adding noise is simulated annealing .

  • Add momentum. Along with using the current gradient to determine the step, continue moving in the same direction as the previous step. If you take part of the previous step as the moment of momentum, there is a tendency to continue moving, which may lead to a search for a local minimum. Using the beat, the steps decay exponentially, so bad steps are not a big problem. This has always been a popular modification of gradient descent when used to train neural networks, where gradient descent is known as backpropagation.

  • Use hybrid search. First, use a global search (for example, genetic algorithms, various Monte Carlo methods) to find good starting points, and then apply gradient descent to take advantage of the gradient information in the function.

I will not give recommendations for use. Instead, I suggest doing a little research to find out what others have done with the problems associated with what you are working on. If this is a purely educational experience, the impulse is probably the easiest to work with.

+17
source

There are many things that can go on:

  • Your step may be a bad choice.
  • Your derivative may be disabled.
  • your "expected value" may be erroneous.
  • your gradient descent might just be slow merging

I would try to increase the execution length, and the workpieces were executed with different step values. A smaller step will have a better chance of avoiding problems that are too great.

0
source

Source: https://habr.com/ru/post/1395059/


All Articles