Random Forest Mismatch Between R and Matlab & Python

Question

Random Forest Mismatch Between R and Matlab & Python

I apply the random forest algorithm in three different programming languages to the same pseudo-sample dataset (1000 obs, 1/0 binary dependent variable, 10 numeric explanatory variables):

Matlab 2015a (same in 2012) using the "Treebagger" command (part of the statistics and machine learning toolbar)
R using the "randomForest" package : https://cran.r-project.org/web/packages/randomForest/index.html
Python using "RandomForestClassifier" from sklearn.ensemble: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

I also try to ensure that all parameters of the model are the same in all programming languages (number of trees, bootstrap sample of the entire sample, number of variables randomly selected as candidates for each split, criterion for measuring the quality of the split).

While Matlab and Python give basically the same results (i.e. probabilties), the results of R are very different.

What is the possible reason for the difference between the results obtained by R on the one hand, and also with Matlab and Python on the other?

I assume that there is some default model parameter, which is different in R, which I do not know or which is hardcoded in the randomForest base package.

The exact code I ran is as follows:

Matlab:

 b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
 [~,scores,~] = predict(b,X);

Python:

 clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
 scores_fit = clf.fit(X, Y)
 scores = pd.DataFrame(clf.predict_proba(X))

R:

 results.rf <- randomForest(X,Y,  ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
 scores <- predict(results.rf, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

+4

python r matlab machine-learning random-forest

Markkk Jan 14 '16 at 17:30

source share

1 answer

Zelazny7 · Accepted Answer · 2016-01-14T18:58:17+0000

When you call an predictobject randomForestin Rwithout providing a dataset, it returns predictions outside the packet. In your other methods, you again transfer training data. I suspect that if you do this in version R, your probabilities will be similar:

 scores <- predict(results.rf, X, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

, , R OOB .

Random Forest Mismatch Between R and Matlab & Python

More articles: