I have highlighted some parts in bold.
Hope this helps.
In some datasets, the prediction error between classes is very high unbalanced. Some classes have low prediction errors, others high. This usually happens when one class is much larger than another. then random forests, trying to minimize the overall error rate, low error rate on a large class, allowing smaller classes have a higher error rate. For example, when drugs are detected where this molecule is classified as active or not, usually the number of assets is 10-1, up to 100 to 1. In these situations, the error rate on an interesting class (assets) will be very high.
The user can detect an imbalance, gives error rates for individual classes. To illustrate 20-dimensional synthetic data used. Class 1 is found in one spherical Gaussian, class 2 - in another. A training kit is created from 1000 classes 1 and 50 of class 2, together with a test set of 5000 classes 1 and 250 of class 2.
The final conclusion of the forest of 500 trees according to these data:
500 3.7 0.0 78.4
There is a low overall test case error (3.73%), but class 2 has more than 3/4 of its cases are mistakenly classified.
Error balancing can be done by setting different weights for classes.
The higher the weight specified by the class, the more its error rate decreases. A guide as to what weights to give is to make them inversely proportional to the class populations. . Thus, the set weights are 1 in classes 1 and 20 in class 2, and run again. Output:
500 12.1 12.7 0.0
Weight 20 to class 2 is too high. Set it to 10 and try again, getting:
500 4.3 4.2 5.2
This is pretty close to balance. If an exact balance is required, the weight in the 2nd grade could have been a bit shaky.
Please note that upon receiving this balance, the overall error rate has increased. This is a common result - in order to get a better balance, the total bet error will be increased.