Proximity Matrix - Random Forest, R

I use the randomForest package in R, which allows me to calculate the proximity matrix (P). In the package description, he describes the parameter as "if proximity = TRUE when randomForest is called, a matrix of measures of approximation among the input (based on the frequency at which pairs of data points are at the same end nodes)."

I get the proximity matrix of a random forest as follows:

P <- randomForest(x, y, ntree = 1000, proximity=TRUE)$proximity 

When I examine the matrix P, I see values ​​like P (i, j) = 0.971014493, where I and j are two data instances in my training dataset (x). This value does not make sense, because when it is multiplied by 1000 (the number of trees in the forest), the resulting number is not an integer, which means "frequency". Can someone please help me understand why I get such real numbers in the proximity matrix?

+6
source share
3 answers

As with the default forecasts, the default proximity is calculated only using trees in which no case was included in the sample used to assemble this tree (they were “outside the bag”).

How many times this happens will vary slightly for each pair of cases and, of course, will not be a good round number, for example 1000.

You will notice that the next parameter specified after proximity is called oob.prox , indicating whether to use only paired packages (by default) or to use each tree.

+9
source

Just add to the above answer, as it looked strange to me, and in case it helps anyone, that according to Braiman (and I quote):

'Internal measure of proximity.

Since a separate tree does not work, terminal nodes will contain only a small number of instances. Run all cases in the training set down the tree. If case I and case j both fall into the same terminal node. increase the proximity between i and j by one. At the end, the gap is divided into twice the number of trees per run and the proximity between the body and the set itself, equal to one. ''

Above was mentioned in Braiman's article “Using Random Forests,” which is a link to the randomforest function here .

+5
source

Proximity is a proportion of how often two data points end on the same leaf node for different trees.

+4
source

Source: https://habr.com/ru/post/969581/


All Articles