How does using smaller values help prevent over-compaction?

Question

How does using smaller values help prevent over-compaction?

To reduce the problem of redefinition in linear regression in machine learning, it is proposed to change the cost function by including the squares of the parameters. This results in lower parameter values.

This is not at all intuitive for me. How can they have smaller values for parameters that lead to a simpler hypothesis and help prevent reinstallation?

+5

machine-learning linear-regression

Anant simran singh Jan 2 '15 at 19:36

source share

3 answers

I put together a pretty far-fetched example, but hopefully this helps.

 import pandas as pd import numpy as np from sklearn import datasets from sklearn.linear_model import Ridge, Lasso from sklearn.cross_validation import train_test_split from sklearn.preprocessing import PolynomialFeatures

First create a linear dataset with a training and testing section. 5 in each

 X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5)

Set data using a fifth-order polynomial without regularization.

 from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler pipeline = Pipeline([ ('poly', PolynomialFeatures(5)), ('model', Ridge(alpha=0.)) # alpha=0 indicates 0 regularization. ]) pipeline.fit(X_train,y_train)

Considering the coefficients

 pipeline.named_steps['model'].coef_ pipeline.named_steps['model'].intercept_ # y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5

Here, the model touches the entire training point, but has high coefficients and does not apply to control points.

Try again, but change our L2 regularization

 pipeline.set_params(model__alpha=1)

 y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5

Here we see a smoother shape, with less movement. It no longer applies to all training points, but it is a much smoother curve. The coefficients are less due to the addition of regularization.

+3

David maust Jan 2 '15 at 20:50

source share

Adding squares to your function (therefore from linear to polynomial) takes care that you can draw a curve instead of a straight line.

An example of a polynomial function:

 y=q+t1*x1+t2*x2^2;

Adding this, however, can lead to a result that follows the test data too much, with the result that new data is matched with close to the test data. Adding more polynomials (3rd, 4th orders). Therefore, when adding polynomials, you always need to make sure that the data is not overloaded.

To get a better idea of this, draw some curves in the spreadsheet and see how the curves change after your data.

0

Norbert van nobelen Jan 2 '15 at 19:48

source share

CAFEBABE · Accepted Answer · 2016-01-02T21:19:27+0000

This is a little trickier. It depends on the algorithm you are using.

Make a simple but a bit stupid example. Instead of optimizing a function parameter

y = a*x1 + b*x2

you can also optimize parameters

  y = 1/a * x1 + 1/b * x2

Obviously, if you minimize in the first case, you need to maximize them in the latter case.

The fact that for most algorithms that minimize the square of parameters comes from the theory of computational learning .

Suppose for the next you want to learn a function

  f(x) = a + bx + c * x^2 + d * x^3 +....

It can be argued that the function only a is different from zero rather than the function where a and b differ from zero, etc. After the Oxams razor (If you have two hypotheses that explain your data, the simpler, rather the correct one), you should prefer a hypothesis in which more of your parameters are zero.

To give an example, suppose your data points are (x, y) = {(-1,0), (1,0)} Which function do you prefer

 f(x) = 0

or

 f(x) = -1 + 1*x^2

Expanding this a bit, you can go from parameters that are zero for small values.

If you want to try, you can try some data points from a linear function and add some Gaussian noise. If you want to find the perfect polynomial shape, you will need a rather complex function with typically large weights. However, if you apply regularization, you will come closer to your data generation function.

But if you want to establish your reasoning about the fundamental theoretical foundations, I would recommend applying Baysian statistics . The idea is that you determine the probability distribution over the regression functions. In this way, you can determine what the “probable” regression function is.

(Actually, Tom Mitchell's machine learning contains a pretty good and detailed explanation)

How does using smaller values ​​help prevent over-compaction?

More articles:

How does using smaller values help prevent over-compaction?