R Random forests Variable value

I am trying to use a random forest package for classification in R.

Variable values ​​of variables are listed below:

  • means the value of the raw value of the variable x for class 0
  • means the value of the important value of the variable x for class 1
  • MeanDecreaseAccuracy
  • MeanDecreaseGini

Now I know what they mean, since I know their definitions. I want to know how to use them.

What I really want to know is what these values ​​mean only in the context of how accurate they are, what is a good value, what is a bad value, what are the maximum and minimum values, etc.

If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini , then is this important or unimportant? Any raw grade information may also be helpful. I want to know everything there is to know about these numbers that are relevant to their application.

An explanation that uses the words error, summation, or permutation would be less useful than a simpler explanation, in which there would be no discussion of how random forests work.

As if I wanted someone to explain to me how to use the radio, I would not expect the explanation to be related to how the radio converts radio waves into sound.

+41
r statistics data-mining random-forest
Apr 10 '09 at 2:18
source share
3 answers

An explanation that uses the words error, summation, or permutation would be less helpful than a simpler explanation that discusses how random forests work.

As if I wanted someone to explain to me how to use the radio, I would not expect an explanation related to how the radio converts radio waves into sound.

How would you explain that the numbers in WKRP 100.5 FM “mean” without going into the real technical details of wave frequencies? Honestly, the parameters and related performance issues with Random Forests are hard to think about, even if you understand some of the technical conditions.

Here is my comment on some answers:

-mean raw value score of variable x for class 0

-mean raw value of the importance of the variable x for class 1

A simplification from the Random Forest web page evaluating raw importance estimates how much more useful than a random, specific predictor variable is in successfully classifying data.

-MeanDecreaseAccuracy

I think this is only in the R-module , and I believe that it measures how the inclusion of this predictor in the model reduces the classification error.

-MeanDecreaseGini

Gini is defined as “injustice” when used to describe the distribution of a company's income or the “impurity node” measure in a tree classification. A lower Gini value (i.e. a higher decrease in Gini) means that a particular predictor variable plays a large role in dividing data into specific classes. It is difficult to describe, not to mention the fact that the data in the classification trees are divided into separate nodes based on the values ​​of the predictors. I don't really understand how this improves performance.

+23
May 08 '09 at
source share
— -

For your immediate concern: higher values ​​mean that variables are more important. This should be true for all measures you specify.

Random forests give you quite complex models, so it can be difficult to interpret measures of importance. If you want to easily understand what your variables do, do not use RF. Instead, use linear models or a tree (no ensemble).

You said:

An explanation that uses the words error, summation, or permutation would be less useful than a simpler explanation that did not include any discussion of how random forests work.

It would be terribly difficult to explain much more than the above if you do not look and find out what random forests are. I assume you are complaining about the manual or section of the Breiman manual:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp

To find out how important a variable is, they fill it with a random undesirable effect (“rearrange” it), and then see how much the accuracy of the forecast decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what raw estimates of importance are.

+20
Jul 22 '09 at 6:54
source share

Interpretation is rather tight with random forests. Although the Russian Federation is an extremely reliable classifier, it makes its forecasts in a democratic way. By this I mean that you are building hundreds or thousands of trees, taking a random subset of your variables and a random subset of your data and create a tree. Then make a prediction for all data not selected and save the prediction. It is durable because it copes with the vagaries of your data set (i.e., it smooths randomly high / low values, random plots / patterns, measuring the same thing in 4 different ways, etc.). However, if you have some highly correlated variables, they may seem important because they are not always included in each model.

One potential approach with random forests can help you lower your predictors and then switch to a regular CART or try the PARTY package for output-based tree models. However, then you should be careful about data mining issues and draw conclusions about the parameters.

+5
Jul 28 '09 at 5:55
source share



All Articles