How does using smaller values โ€‹โ€‹help prevent over-compaction?

To reduce the problem of redefinition in linear regression in machine learning, it is proposed to change the cost function by including the squares of the parameters. This results in lower parameter values.

This is not at all intuitive for me. How can they have smaller values โ€‹โ€‹for parameters that lead to a simpler hypothesis and help prevent reinstallation?

+5
source share
3 answers

This is a little trickier. It depends on the algorithm you are using.

Make a simple but a bit stupid example. Instead of optimizing a function parameter

y = a*x1 + b*x2 

you can also optimize parameters

  y = 1/a * x1 + 1/b * x2 

Obviously, if you minimize in the first case, you need to maximize them in the latter case.

The fact that for most algorithms that minimize the square of parameters comes from the theory of computational learning .

Suppose for the next you want to learn a function

  f(x) = a + bx + c * x^2 + d * x^3 +.... 

It can be argued that the function only a is different from zero rather than the function where a and b differ from zero, etc. After the Oxams razor (If you have two hypotheses that explain your data, the simpler, rather the correct one), you should prefer a hypothesis in which more of your parameters are zero.

To give an example, suppose your data points are (x, y) = {(-1,0), (1,0)} Which function do you prefer

 f(x) = 0 

or

 f(x) = -1 + 1*x^2 

Expanding this a bit, you can go from parameters that are zero for small values.

If you want to try, you can try some data points from a linear function and add some Gaussian noise. If you want to find the perfect polynomial shape, you will need a rather complex function with typically large weights. However, if you apply regularization, you will come closer to your data generation function.

But if you want to establish your reasoning about the fundamental theoretical foundations, I would recommend applying Baysian statistics . The idea is that you determine the probability distribution over the regression functions. In this way, you can determine what the โ€œprobableโ€ regression function is.

(Actually, Tom Mitchell's machine learning contains a pretty good and detailed explanation)

+2
source

I put together a pretty far-fetched example, but hopefully this helps.

 import pandas as pd import numpy as np from sklearn import datasets from sklearn.linear_model import Ridge, Lasso from sklearn.cross_validation import train_test_split from sklearn.preprocessing import PolynomialFeatures 

First create a linear dataset with a training and testing section. 5 in each

 X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5) 

Initial data

Set data using a fifth-order polynomial without regularization.

 from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler pipeline = Pipeline([ ('poly', PolynomialFeatures(5)), ('model', Ridge(alpha=0.)) # alpha=0 indicates 0 regularization. ]) pipeline.fit(X_train,y_train) 

Considering the coefficients

 pipeline.named_steps['model'].coef_ pipeline.named_steps['model'].intercept_ # y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5 

No regularization

Here, the model touches the entire training point, but has high coefficients and does not apply to control points.

Try again, but change our L2 regularization

 pipeline.set_params(model__alpha=1) 

With regularization

 y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5 

Here we see a smoother shape, with less movement. It no longer applies to all training points, but it is a much smoother curve. The coefficients are less due to the addition of regularization.

+3
source

Adding squares to your function (therefore from linear to polynomial) takes care that you can draw a curve instead of a straight line.

An example of a polynomial function:

 y=q+t1*x1+t2*x2^2; 

Adding this, however, can lead to a result that follows the test data too much, with the result that new data is matched with close to the test data. Adding more polynomials (3rd, 4th orders). Therefore, when adding polynomials, you always need to make sure that the data is not overloaded.

To get a better idea of โ€‹โ€‹this, draw some curves in the spreadsheet and see how the curves change after your data.

0
source

Source: https://habr.com/ru/post/1239690/


All Articles