I prepared a simple model using the library xgboost
in R on the matrix created by sparse.model.matrix, then I made a prediction for two sets of validation data - one created sparse.model.matrix
from Matrix
, and the second from model.matrix
from stats
. To my great surprise, the results vary widely. Rare and dense matrices have the same size, all data are numerical and there are no missing values.
The average prediction for these two sets is as follows:
- dense check matrix: 0.5009256
- sparse check matrix: 0.4988821
Is this a sign or a mistake?
Update:
I noticed that an error does not occur when all values ββare positive xor negative. If the variable x1
has a definition x1=sample(1:7, 2000, replace=T)
, the average prediction is the same in both cases.
Code in R:
require(Matrix)
require(xgboost)
valid <- data.frame(y=sample(0:1, 2000, replace=T), x1=sample(-1:5, 2000, replace=T), x2=runif(2000))
train <- data.frame(y=sample(0:1, 10000, replace=T), x1=sample(-1:5, 10000, replace=T), x2=runif(10000))
sparse_train_matrix <- sparse.model.matrix(~ ., data=train[, c("x1", "x2")])
d_sparse_train_matrix <- xgb.DMatrix(sparse_train_matrix, label = train$y)
sparse_valid_matrix <- sparse.model.matrix(~ ., data=valid[, c("x1", "x2")])
d_sparse_valid_matrix <- xgb.DMatrix(sparse_valid_matrix, label = valid$y)
valid_matrix <- model.matrix(~ ., data=valid[, c("x1", "x2")])
d_valid_matrix <- xgb.DMatrix(valid_matrix, label = valid$y)
params = list(objective = "binary:logistic", seed = 99, eval_metric = "auc")
sparse_w <- list(train=d_sparse_train_matrix, test=d_sparse_valid_matrix)
set.seed(1)
sprase_fit_xgb <- xgb.train(data=d_sparse_train_matrix, watchlist=sparse_w, params=params, nrounds=100)
p1 <- predict(sprase_fit_xgb, newdata=d_valid_matrix, type="response")
p2 <- predict(sprase_fit_xgb, newdata=d_sparse_valid_matrix, type="response")
mean(p1); mean(p2)
My session:
R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
[5] LC_TIME=Polish_Poland.1250
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] xgboost_0.6-4 Matrix_1.2-10 data.table_1.10.4 dplyr_0.7.1
loaded via a namespace (and not attached): [1] Rcpp_0.12.11 lattice_0.20-35 assertthat_0.2.0 grid_3.4.1
[5] R6_2.2.2 magrittr_1.5 stringi_1.1.5 rlang_0.1.1
[9] bindrcpp_0.2 tools_3.4.1 glue_1.1.1 compiler_3.4.1
[13] pkgconfig_2.0.1 bindr_0.1 tibble_1.3.3