I prepared a simple model using the library xgboostin R on the matrix created by sparse.model.matrix, then I made a prediction for two sets of validation data - one created sparse.model.matrixfrom Matrix, and the second from model.matrixfrom stats. To my great surprise, the results vary widely. Rare and dense matrices have the same size, all data are numerical and there are no missing values.
The average prediction for these two sets is as follows:
- dense check matrix: 0.5009256
 - sparse check matrix: 0.4988821
 
Is this a sign or a mistake?
Update:
I noticed that an error does not occur when all values ββare positive xor negative. If the variable x1has a definition x1=sample(1:7, 2000, replace=T), the average prediction is the same in both cases.
Code in R:
require(Matrix)
require(xgboost)
valid <- data.frame(y=sample(0:1, 2000, replace=T), x1=sample(-1:5, 2000, replace=T), x2=runif(2000))
train <- data.frame(y=sample(0:1, 10000, replace=T), x1=sample(-1:5, 10000, replace=T), x2=runif(10000))
sparse_train_matrix <- sparse.model.matrix(~ ., data=train[, c("x1", "x2")])
d_sparse_train_matrix <- xgb.DMatrix(sparse_train_matrix, label = train$y)
sparse_valid_matrix <- sparse.model.matrix(~ ., data=valid[, c("x1", "x2")])
d_sparse_valid_matrix <- xgb.DMatrix(sparse_valid_matrix, label = valid$y)
valid_matrix <- model.matrix(~ ., data=valid[, c("x1", "x2")])
d_valid_matrix <- xgb.DMatrix(valid_matrix, label = valid$y)
params = list(objective = "binary:logistic", seed = 99, eval_metric = "auc")
sparse_w <- list(train=d_sparse_train_matrix, test=d_sparse_valid_matrix)
set.seed(1)
sprase_fit_xgb <- xgb.train(data=d_sparse_train_matrix, watchlist=sparse_w, params=params, nrounds=100)
p1 <- predict(sprase_fit_xgb, newdata=d_valid_matrix, type="response")
p2 <- predict(sprase_fit_xgb, newdata=d_sparse_valid_matrix, type="response")
mean(p1); mean(p2)
My session:
R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C 
[5] LC_TIME=Polish_Poland.1250
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] xgboost_0.6-4 Matrix_1.2-10 data.table_1.10.4 dplyr_0.7.1
loaded via a namespace (and not attached): [1] Rcpp_0.12.11 lattice_0.20-35 assertthat_0.2.0 grid_3.4.1 
[5] R6_2.2.2 magrittr_1.5 stringi_1.1.5 rlang_0.1.1 
[9] bindrcpp_0.2 tools_3.4.1 glue_1.1.1 compiler_3.4.1 
[13] pkgconfig_2.0.1 bindr_0.1 tibble_1.3.3