I am trying to use XGBoost to simulate the frequency of statements of data generated from periods of unequal duration, but could not get the model to properly handle exposure. I usually do this by setting the log (exposure) as an offset - can you do this in XGBoost?
(A similar question was posted here: xgboost, exposure bias? )
To illustrate the problem, the R code below generates some data with fields:
- x1, x2 - factors (0 or 1)
- impact - the length of the policy period for the observed data
- frequency - the average number of claims per exposure unit.
- claims - the number of claims observed ~ Poisson (frequency * exposure)
The goal is to predict the frequency using x1 and x2 - the true model: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.
Exposure cannot be used to predict frequency, since it is not known at the beginning of politics. The only way we can use it is to say: expected number of requirements = frequency * exposure.
The code tries to predict this with XGBoost:
- Adjust exposure as weight in the model matrix
- Settings log (exposure) as an offset
Below, I showed how I will handle the situation for a tree (rpart) or gbm.
set.seed(1) size<-10000 d <- data.frame( x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)), x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)), exposure = runif(size, 1, 10)*0.3 ) d$frequency <- 2^(d$x1==1 & d$x2==1) d$claims <- rpois(size, lambda = d$frequency * d$exposure) #### Try to fit using XGBoost require(xgboost) param0 <- list( "objective" = "count:poisson" , "eval_metric" = "logloss" , "eta" = 1 , "subsample" = 1 , "colsample_bytree" = 1 , "min_child_weight" = 1 , "max_depth" = 2 ) ## 1 - set weight in xgb.Matrix xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure) xgb = xgb.train( nrounds = 1 , params = param0 , data = xgtrain ) d$XGB_P_1 <- predict(xgb, xgtrain) ## 2 - set as offset in xgb.Matrix xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d) xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d) xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims) xgb = xgb.train( nrounds = 1 , params = param0 , data = xgtrain ) d$XGB_P_2 <- predict(model, xgtrain) #### Fit a tree require(rpart) d[,"tree_response"] <- cbind(d$exposure,d$claims) tree <- rpart(tree_response ~ x1 + x2, data = d, method = "poisson") d$Tree_F <- predict(tree, newdata = d) #### Fit a GBM gbm <- gbm(claims~x1+x2+offset(log(exposure)), data = d, distribution = "poisson", n.trees = 1, shrinkage=1, interaction.depth=2, bag.fraction = 0.5) d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")