XGBoost - Poisson distribution with variable exposure / bias

I am trying to use XGBoost to simulate the frequency of statements of data generated from periods of unequal duration, but could not get the model to properly handle exposure. I usually do this by setting the log (exposure) as an offset - can you do this in XGBoost?

(A similar question was posted here: xgboost, exposure bias? )

To illustrate the problem, the R code below generates some data with fields:

  • x1, x2 - factors (0 or 1)
  • impact - the length of the policy period for the observed data
  • frequency - the average number of claims per exposure unit.
  • claims - the number of claims observed ~ Poisson (frequency * exposure)

The goal is to predict the frequency using x1 and x2 - the true model: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

Exposure cannot be used to predict frequency, since it is not known at the beginning of politics. The only way we can use it is to say: expected number of requirements = frequency * exposure.

The code tries to predict this with XGBoost:

  • Adjust exposure as weight in the model matrix
  • Settings log (exposure) as an offset

Below, I showed how I will handle the situation for a tree (rpart) or gbm.

set.seed(1) size<-10000 d <- data.frame( x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)), x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)), exposure = runif(size, 1, 10)*0.3 ) d$frequency <- 2^(d$x1==1 & d$x2==1) d$claims <- rpois(size, lambda = d$frequency * d$exposure) #### Try to fit using XGBoost require(xgboost) param0 <- list( "objective" = "count:poisson" , "eval_metric" = "logloss" , "eta" = 1 , "subsample" = 1 , "colsample_bytree" = 1 , "min_child_weight" = 1 , "max_depth" = 2 ) ## 1 - set weight in xgb.Matrix xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure) xgb = xgb.train( nrounds = 1 , params = param0 , data = xgtrain ) d$XGB_P_1 <- predict(xgb, xgtrain) ## 2 - set as offset in xgb.Matrix xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d) xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d) xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims) xgb = xgb.train( nrounds = 1 , params = param0 , data = xgtrain ) d$XGB_P_2 <- predict(model, xgtrain) #### Fit a tree require(rpart) d[,"tree_response"] <- cbind(d$exposure,d$claims) tree <- rpart(tree_response ~ x1 + x2, data = d, method = "poisson") d$Tree_F <- predict(tree, newdata = d) #### Fit a GBM gbm <- gbm(claims~x1+x2+offset(log(exposure)), data = d, distribution = "poisson", n.trees = 1, shrinkage=1, interaction.depth=2, bag.fraction = 0.5) d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response") 
+5
source share
2 answers

At a minimum, with the glm function in R, modeling count ~ x1 + x2 + offset(log(exposure)) with family=poisson(link='log') equivalent to modeling I(count/exposure) ~ x1 + x2 with family=poisson(link='log') and weight=exposure . That is, normalize your score by exposing it to frequency, and simulate the frequency with exposure as weight. Your estimated coefficients should be the same in both cases when using glm for Poisson regression. Try it yourself using sample data

I'm not quite sure what objective='count:poisson' matches, but I would expect to set your target variable as frequency (count / exposure) and use exposure, since weight in xgboost would be a way when exposure varies.

+1
source

Now I have developed how to do this using setinfo to change the base_margin attribute as an offset (as a linear predictor), i.e.:

 setinfo(xgtrain, "base_margin", log(d$exposure)) 
+1
source

Source: https://habr.com/ru/post/1243969/


All Articles