I am working on a machine learning problem and want to use linear regression as a learning algorithm. I applied two different methods to search for theta
parameters of a linear regression model: gradient (steep) descent and normal equation. According to the same data, they should give an approximately equal vector theta
. However, they do not.
Both theta
vectors are very similar to all elements except the first. This is the one used to multiply the vector of all 1 added to the data.
This is what theta
looks like (the first column is derived from gradient descent, the second output of the normal equation):
Grad desc Norm eq -237.7752 -4.6736 -5.8471 -5.8467 9.9174 9.9178 2.1135 2.1134 -1.5001 -1.5003 -37.8558 -37.8505 -1.1024 -1.1116 -19.2969 -19.2956 66.6423 66.6447 297.3666 296.7604 -741.9281 -744.1541 296.4649 296.3494 146.0304 144.4158 -2.9978 -2.9976 -0.8190 -0.8189
What can cause the difference in theta(1, 1)
returned by gradient descent compared to theta(1, 1)
returned by the normal equation? Do I have a bug in the code?
Here is my implementation of the normal equation in Matlab:
function theta = normalEque(X, y) [m, n] = size(X); X = [ones(m, 1), X]; theta = pinv(X'*X)*X'*y; end
Here is the code for gradient descent:
function theta = gradientDesc(X, y) options = optimset('GradObj', 'on', 'MaxIter', 9999); [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),... zeros(size(X, 2), 1), options); end function [J, grad] = cost(theta, X, y) m = size(X, 1); X = [ones(m, 1), X]; J = sum((X * theta - y) .^ 2) ./ (2*m); for i = 1:size(theta, 1) grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m; end end
I pass the same data X
and y
for both functions (I do not normalize X
).
Change 1:
Based on the answers and comments, I checked some of my codes and conducted some tests.
First I want to check if the problem can be the reason when X is close to singular, as suggested by @ user1489497 answer . So I replaced pinv with inv - and when I started, I really got a warning Matrix is close to singular or badly scaled.
. To make sure this is not a problem, I got a much larger dataset and tested the tests with this new dataset. This time inv(X)
did not display a warning, and the same results were obtained with pinv
and inv
. Therefore, I hope that X
no longer close to singular .
Then I changed the normalEque
code as suggested on woodchips , so now it looks like this:
function theta = normalEque(X, y) X = [ones(size(X, 1), 1), X]; theta = pinv(X)*y; end
However, the problem still exists . The new normalEque
function for new data that is not close to singular distinguishes theta
as gradientDesc
.
To find out which algorithm is a mistake, I performed a linear regression algorithm for Weka data mining software using the same data. Weka computed theta very similar to the output of normalEque
, but different from the output of gradientDesc
. Therefore, I assume that normalEque
correct and there is an error in gradientDesc
.
Here is a comparison of theta
computed by Weka, normalEque
and gradientDesc
:
Weka(correct) normalEque gradientDesc 779.8229 779.8163 302.7994 1.6571 1.6571 1.7064 1.8430 1.8431 2.3809 -1.5945 -1.5945 -1.5964 3.8190 3.8195 5.7486 -4.8265 -4.8284 -11.1071 -6.9000 -6.9006 -11.8924 -15.6956 -15.6958 -13.5411 43.5561 43.5571 31.5036 -44.5380 -44.5386 -26.5137 0.9935 0.9926 1.2153 -3.1556 -3.1576 -1.8517 -0.1927 -0.1919 -0.6583 2.9207 2.9227 1.5632 1.1713 1.1710 1.1622 0.1091 0.1093 0.0084 1.5768 1.5762 1.6318 -1.3968 -1.3958 -2.1131 0.6966 0.6963 0.5630 0.1990 0.1990 -0.2521 0.4624 0.4624 0.2921 -12.6013 -12.6014 -12.2014 -0.1328 -0.1328 -0.1359
I also calculated the errors suggested by Justin Peel . The output of normalEque
gives a slightly less square error, but the difference is marginal. Which is more , when I calculate theta
gradient of cost using the cost
function (same as for gradientDesc
), I got a gradient near zero . Same as the output of gradientDesc
, does not give a gradient near zero. Here is what I mean:
>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1)); >> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1)); >> disp([J_gd, J_ne]) 120.9932 119.1469 >> disp([grad_gd, grad_ne]) -0.005172856743846 -0.000000000908598 -0.026126463200876 -0.000000135414602 -0.008365136595272 -0.000000140327001 -0.094516503056041 -0.000000169627717 -0.028805977931093 -0.000000045136985 -0.004761477661464 -0.000000005065103 -0.007389474786628 -0.000000005010731 0.065544198835505 -0.000000046847073 0.044205371015018 -0.000000046169012 0.089237705611538 -0.000000046081288 -0.042549228192766 -0.000000051458654 0.016339232547159 -0.000000037654965 -0.043200042729041 -0.000000051748545 0.013669010209370 -0.000000037399261 -0.036586854750176 -0.000000027931617 -0.004761447097231 -0.000000027168798 0.017311225027280 -0.000000039099380 0.005650124339593 -0.000000037005759 0.016225097484138 -0.000000039060168 -0.009176443862037 -0.000000012831350 0.055653840638386 -0.000000020855391 -0.002834810081935 -0.000000006540702 0.002794661393905 -0.000000032878097
This assumes that the gradient descent simply does not converge to a global minimum ... But this is hardly the case, as I run it for thousands of iterations. So where is the mistake?