How does scikitlearn search for a line?

In this section of the gradient boost documentation, it says

Gradient Boosting tries to solve this minimization problem numerically using the steepest descent: the steepest direction of descent is the negative gradient of the loss function, estimated on the current model F_ {m-1}, which can be calculated for any differentiable loss function:

enter image description here

If the step length \ gamma_m is selected using the line search:

enter image description here

I understand the purpose of linear search, but I don’t understand the algorithm itself. I read the source code but it still has not clicked. An explanation would be very helpful.

+5
source share
3 answers

The implementation depends on which loss function you choose when you initialize the GradientBoostingClassifier instance (use this, for example, part of the regression should be similar). The default loss function is 'deviance' and the corresponding optimization algorithm is implemented. In the _update_terminal_region function, _update_terminal_region simple iteration is implemented with just one step.

Do you need this answer?

0
source

I suspect that you are mistaken, this is: you can see where scikit-learn calculates the negative gradient of the loss function and is suitable for a basic estimate of this negative gradient. It seems that the _update_terminal_region method _update_terminal_region responsible for determining the step size, but you nowhere see that it can solve the problem of minimizing linear search, as written in the documentation.

The reason you cannot find string searches is because for the particular case of regression decision trees, which are only piecewise constant functions, the optimal solution is usually known. For example, if you look at the _update_terminal_region method of the _update_terminal_region loss LeastAbsoluteError , you will see that the foliage of the tree is assigned the value of the weighted median of the difference between y and the predicted value for examples for which this sheet matters. This median is a known optimal solution.

To summarize what happens, for each iteration with gradient descents, the following steps are performed:

  • Calculate the negative gradient of the loss function in the current forecast.

  • Set DecisionTreeRegressor to a negative gradient. This fitting creates a tree with good splits to reduce losses.

  • Replace the values ​​in the DecisionTreeRegressor sheets with values ​​that minimize loss. They are usually calculated using some simple known formula that uses the fact that the decision tree is just a piecewise constant function.

This method should be at least as good as what is described in the documents, but I think that in some cases it may not be identical to it.

0
source

From your comments, it seems that the algorithm itself is unclear, not the way scikitlearn implements it.

The designation in the wikipedia article is a bit messy, one not only differentiates by a function evaluated at a point. When you replace F_{m-1}(x_i) with \hat{y_i} and replace the partial derivative with the partial derivative estimated at \hat{y}=F_{m-1}(x) , everything becomes clear:

enter image description here

It will also remove x_{i} (sort) from the minimization problem and show the intention of the linear search - to optimize depending on the current forecast and not depend on the training set. Now notice that:

enter image description here

Therefore, you simply collapse:

enter image description here

Thus, a linear search simply optimizes one degree of freedom that you have (after you find the right direction of the gradient) - step size.

0
source

Source: https://habr.com/ru/post/1261520/


All Articles