Random Forest Mismatch Between R and Matlab & Python

I apply the random forest algorithm in three different programming languages ​​to the same pseudo-sample dataset (1000 obs, 1/0 binary dependent variable, 10 numeric explanatory variables):

I also try to ensure that all parameters of the model are the same in all programming languages ​​(number of trees, bootstrap sample of the entire sample, number of variables randomly selected as candidates for each split, criterion for measuring the quality of the split).

While Matlab and Python give basically the same results (i.e. probabilties), the results of R are very different.

What is the possible reason for the difference between the results obtained by R on the one hand, and also with Matlab and Python on the other?

I assume that there is some default model parameter, which is different in R, which I do not know or which is hardcoded in the randomForest base package.

The exact code I ran is as follows:

Matlab:

 b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
 [~,scores,~] = predict(b,X);

Python:

 clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
 scores_fit = clf.fit(X, Y)
 scores = pd.DataFrame(clf.predict_proba(X))

R:

 results.rf <- randomForest(X,Y,  ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
 scores <- predict(results.rf, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
+4
source share
1 answer

When you call an predictobject randomForestin Rwithout providing a dataset, it returns predictions outside the packet. In your other methods, you again transfer training data. I suspect that if you do this in version R, your probabilities will be similar:

 scores <- predict(results.rf, X, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

, , R OOB .

+4

Source: https://habr.com/ru/post/1624313/


All Articles