Why do we use gradient descent in linear regression?

Question

Why do we use gradient descent in linear regression?

In some machine learning classes I took recently, I turned on gradient descent to find the line of best fit for linear regression.

In some classes of statistics, I learned that we can calculate this line, using statistical analysis, using mean and standard deviation - this page describes in detail this approach. Why is this seemingly simpler method not used in machines?

My question is, is gradient descent the preferred method for installing linear models? If so, why? Or did the professor simply use gradient descent in a simpler setup to introduce the class into the technique?

+6

statistics machine-learning linear-regression

Guy needham Nov 07 '14 at 15:34

source share

2 answers

Basically, the gradient descent algorithm is a general optimization technique and can be used to optimize any cost function. It is often used when the optimal point cannot be estimated in a closed form solution.

So, let's say we want to minimize the cost function. What ends in the gradient descent is that we start at some random starting point and we try to move in the direction of the “gradient” in order to reduce the cost function. We step by step until there is no cost reduction. At this time, we are at the minimum point. To make it easier to understand, imagine a bowl and bowl. If we drop the ball from some starting point on the bowl, it will move until it is installed at the bottom of the bowl.

Since gradient descent is a general algorithm, it can be applied to any problem that requires optimization of the cost function. The regression problem often uses the cost function, which is the root mean square error (MSE). The search for a closed-form solution requires inverting the matrix, which in most cases is poorly conditioned (the determinant is very close to zero and, therefore, does not provide a reliable inverse matrix). To get around this problem, people often use the gradient descent approach to find a solution that does not suffer from a bad conditional problem.

+1

Sina Nov 09 '14 at 22:28

source share

Andreas Mueller · Accepted Answer · 2014-11-07T16:02:19+0000

The example you gave is one-dimensional, which usually does not occur in machine processes where you have several input functions. In this case, you need to invert the matrix in order to use their simple approach, which can be tough or poorly conditioned.

Usually the problem is formulated as a least square problem, which is a bit simpler. There are standard smallest square solvers that can be used instead of gradient descent (and often). If the number of data points is very high, using a standard least-squares solution can be too expensive, and (stochastic) gradient descent can give you a solution that is as good in terms of test error as a more accurate solution with a run time that is several orders of magnitude less ( see this wonderful chapter by Leon Bott )

If your problem is small, that it can be effectively solved with a ready-made solution of least squares, you probably should not do gradient descent.

Why do we use gradient descent in linear regression?

More articles: