How to handle distorted response in H2O algorithms

In my problem, the dataset response variable is extremely distorted to the left. I tried to fit the model using h2o.randomForest() and h2o.gbm() as shown below. I can configure tone min_split_improvement and min_rows to avoid overfitting in these two cases. But with these models I see very high errors in tail observations. I tried using weights_column to reprogram the tail observations and the shortcomings of other observations, but this does not help.

 h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1, ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", stopping_rounds = 10, min_split_improvement = 0.0005) h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE", stopping_rounds = 10, min_split_improvement = 0.0005) 

I tried the h2o.automl() function of the h2o package to solve this problem for better performance. However, I see significant rework. I do not know any parameters in h2o.automl() for remapping control.

Does anyone know how to avoid overriding with h2o.automl() ?

EDIT

The distribution of the transformed log response is shown below. After Erin's suggestion enter image description here

EDIT2: Distribution of the original response.

enter image description here

+5
source share
2 answers

H2O AutoML uses H2O algos (like RF, GBM) from below, so if you cannot get good models there, you will experience the same problems using AutoML. I'm not sure what I would call this refit - especially since your models do not succeed in predicting emissions.

My recommendation is to register your response variable - this is a useful thing when you make a skewed reaction. In the future, H2O AutoML will try to automatically detect a distorted answer and take a log, but this is not a function of the current version (H2O 3.16. *).

Here is a little more detail if you are not familiar with this process. First create a new column, for example. log_response , as follows and use this as an answer when learning (in RF, GBM or AutoML):

 train[,"log_response"] <- h2o.log(train[,response]) 

Cautions. If you have zeros in the answer, you should use h2o.log1p() . Do not include the original answer in your predictors. In your case, you do not need to change anything, because you already explicitly specify predictors using the vector predictors .

Keep in mind that when you record a response, your predictions and model metrics will appear in the log scale. Therefore, if you need to convert your forecasts back to a normal scale, for example:

 model <- h2o.randomForest(x = predictors, y = "log_response", training_frame = train, valid = valid) log_pred <- h2o.predict(model, test) pred <- h2o.exp(log_pred) 

This gives you predictions, but if you also want to see metrics, you will have to calculate those that use the h2o.make_metrics() function using new ancestors, rather than retrieving metrics from the model.

 perf <- h2o.make_metrics(predicted = pred, actual = test[,response]) h2o.mse(perf) 

You can try this using RF, as I showed above, or GBM, or with AutoML (which should provide better performance than RF or GBM alone).

Hope this helps improve the performance of your models!

+12
source

When your target variable is skewed, mse is not suitable for use. I would try changing the loss function, because gbm is trying to fit the model to the gradient of the loss function, and you want to make sure that you are using the correct distribution. if you have a spike at zero and the correct skewed positive target, then probably Tweedie would be a better option.

0
source

Source: https://habr.com/ru/post/1274798/


All Articles