Proximity Matrix - Random Forest, R

Question

Proximity Matrix - Random Forest, R

I use the randomForest package in R, which allows me to calculate the proximity matrix (P). In the package description, he describes the parameter as "if proximity = TRUE when randomForest is called, a matrix of measures of approximation among the input (based on the frequency at which pairs of data points are at the same end nodes)."

I get the proximity matrix of a random forest as follows:

P <- randomForest(x, y, ntree = 1000, proximity=TRUE)$proximity

When I examine the matrix P, I see values like P (i, j) = 0.971014493, where I and j are two data instances in my training dataset (x). This value does not make sense, because when it is multiplied by 1000 (the number of trees in the forest), the resulting number is not an integer, which means "frequency". Can someone please help me understand why I get such real numbers in the proximity matrix?

+6

r statistics random-forest proximity

banbar May 20, '14 at 14:00

source share

3 answers

Just add to the above answer, as it looked strange to me, and in case it helps anyone, that according to Braiman (and I quote):

'Internal measure of proximity.

Since a separate tree does not work, terminal nodes will contain only a small number of instances. Run all cases in the training set down the tree. If case I and case j both fall into the same terminal node. increase the proximity between i and j by one. At the end, the gap is divided into twice the number of trees per run and the proximity between the body and the set itself, equal to one. ''

Above was mentioned in Braiman's article “Using Random Forests,” which is a link to the randomforest function here .

+5

LyzandeR Oct 28 '14 at 15:15

source share

Proximity is a proportion of how often two data points end on the same leaf node for different trees.

+4

Waqar detho Feb 19 '15 at 11:45

source share

joran · Accepted Answer · 2014-05-20T15:10:08+0000

As with the default forecasts, the default proximity is calculated only using trees in which no case was included in the sample used to assemble this tree (they were “outside the bag”).

How many times this happens will vary slightly for each pair of cases and, of course, will not be a good round number, for example 1000.

You will notice that the next parameter specified after proximity is called oob.prox , indicating whether to use only paired packages (by default) or to use each tree.

Proximity Matrix - Random Forest, R

'Internal measure of proximity.

More articles: