I apply the random forest algorithm in three different programming languages to the same pseudo-sample dataset (1000 obs, 1/0 binary dependent variable, 10 numeric explanatory variables):
I also try to ensure that all parameters of the model are the same in all programming languages (number of trees, bootstrap sample of the entire sample, number of variables randomly selected as candidates for each split, criterion for measuring the quality of the split).
While Matlab and Python give basically the same results (i.e. probabilties), the results of R are very different.
What is the possible reason for the difference between the results obtained by R on the one hand, and also with Matlab and Python on the other?
I assume that there is some default model parameter, which is different in R, which I do not know or which is hardcoded in the randomForest base package.
The exact code I ran is as follows:
Matlab:
b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
[~,scores,~] = predict(b,X);
Python:
clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
scores_fit = clf.fit(X, Y)
scores = pd.DataFrame(clf.predict_proba(X))
R:
results.rf <- randomForest(X,Y, ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
scores <- predict(results.rf, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)