This is a little trickier. It depends on the algorithm you are using.
Make a simple but a bit stupid example. Instead of optimizing a function parameter
y = a*x1 + b*x2
you can also optimize parameters
y = 1/a * x1 + 1/b * x2
Obviously, if you minimize in the first case, you need to maximize them in the latter case.
The fact that for most algorithms that minimize the square of parameters comes from the theory of computational learning .
Suppose for the next you want to learn a function
f(x) = a + bx + c * x^2 + d * x^3 +....
It can be argued that the function only a is different from zero rather than the function where a and b differ from zero, etc. After the Oxams razor (If you have two hypotheses that explain your data, the simpler, rather the correct one), you should prefer a hypothesis in which more of your parameters are zero.
To give an example, suppose your data points are (x, y) = {(-1,0), (1,0)} Which function do you prefer
f(x) = 0
or
f(x) = -1 + 1*x^2
Expanding this a bit, you can go from parameters that are zero for small values.
If you want to try, you can try some data points from a linear function and add some Gaussian noise. If you want to find the perfect polynomial shape, you will need a rather complex function with typically large weights. However, if you apply regularization, you will come closer to your data generation function.
But if you want to establish your reasoning about the fundamental theoretical foundations, I would recommend applying Baysian statistics . The idea is that you determine the probability distribution over the regression functions. In this way, you can determine what the โprobableโ regression function is.
(Actually, Tom Mitchell's machine learning contains a pretty good and detailed explanation)