Prediction of xgboost in R differs on sparse and dense matrices

I prepared a simple model using the library xgboostin R on the matrix created by sparse.model.matrix, then I made a prediction for two sets of validation data - one created sparse.model.matrixfrom Matrix, and the second from model.matrixfrom stats. To my great surprise, the results vary widely. Rare and dense matrices have the same size, all data are numerical and there are no missing values.

The average prediction for these two sets is as follows:

  • dense check matrix: 0.5009256
  • sparse check matrix: 0.4988821

Is this a sign or a mistake?

Update:

I noticed that an error does not occur when all values ​​are positive xor negative. If the variable x1has a definition x1=sample(1:7, 2000, replace=T), the average prediction is the same in both cases.

Code in R:

require(Matrix)
require(xgboost)

valid <- data.frame(y=sample(0:1, 2000, replace=T), x1=sample(-1:5, 2000, replace=T), x2=runif(2000))
train <- data.frame(y=sample(0:1, 10000, replace=T), x1=sample(-1:5, 10000, replace=T), x2=runif(10000))

sparse_train_matrix <- sparse.model.matrix(~ ., data=train[, c("x1", "x2")])
d_sparse_train_matrix <- xgb.DMatrix(sparse_train_matrix, label = train$y)

sparse_valid_matrix <- sparse.model.matrix(~ ., data=valid[, c("x1", "x2")])
d_sparse_valid_matrix <- xgb.DMatrix(sparse_valid_matrix, label = valid$y)

valid_matrix <- model.matrix(~ ., data=valid[, c("x1", "x2")])
d_valid_matrix <- xgb.DMatrix(valid_matrix, label = valid$y)

params = list(objective = "binary:logistic", seed = 99, eval_metric = "auc")

sparse_w <- list(train=d_sparse_train_matrix, test=d_sparse_valid_matrix)
set.seed(1)
sprase_fit_xgb <- xgb.train(data=d_sparse_train_matrix, watchlist=sparse_w, params=params, nrounds=100)

p1 <- predict(sprase_fit_xgb, newdata=d_valid_matrix, type="response")
p2 <- predict(sprase_fit_xgb, newdata=d_sparse_valid_matrix, type="response")

mean(p1); mean(p2)

My session:

R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C 
[5] LC_TIME=Polish_Poland.1250

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] xgboost_0.6-4 Matrix_1.2-10 data.table_1.10.4 dplyr_0.7.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.11 lattice_0.20-35 assertthat_0.2.0 grid_3.4.1 
[5] R6_2.2.2 magrittr_1.5 stringi_1.1.5 rlang_0.1.1 
[9] bindrcpp_0.2 tools_3.4.1 glue_1.1.1 compiler_3.4.1 
[13] pkgconfig_2.0.1 bindr_0.1 tibble_1.3.3
+4
source share
1 answer

I have found here and here that this is the expected behavior and meaning for me.

+2
source

Source: https://habr.com/ru/post/1687137/


All Articles