The correct ratio of positive and negative learning examples for teaching a random binary forest-based classifier

I realized that the question related to it. The percentage of positive / negative factors in the train set suggested that the ratio of positive and negative training examples 1 to 1 is favorable for the Rocchio algorithm.

However, this question differs from the related question in that it concerns a random forest model, as well as in the following two ways.

1) I have a lot of training data for work, and the main bottleneck for using more training examples is the iteration time of the training. That is, I would prefer not to take more than night to train one ranking, because I want to quickly repeat.

2) In practice, the classifier is likely to see 1 positive example for every 4 negative examples.

In this situation, should more negative examples be used than positive examples, or is there an equal number of positive and negative examples?

+6
source share
2 answers

See the “Prediction Forecasting Error” section of the official random forest documentation here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

I have highlighted some parts in bold.

Overall, this seems to suggest that your training and test data should either

  • reflects a 1: 4 class ratio that your real data will have or
  • you may have a 1: 1 mixture, but then you must carefully adjust the weights in the class, as shown below, to the level of OOB error on your desired (smaller) class is omitted.

Hope this helps.

In some datasets, the prediction error between classes is very high unbalanced. Some classes have low prediction errors, others high. This usually happens when one class is much larger than another. then random forests, trying to minimize the overall error rate, low error rate on a large class, allowing smaller classes have a higher error rate. For example, when drugs are detected where this molecule is classified as active or not, usually the number of assets is 10-1, up to 100 to 1. In these situations, the error rate on an interesting class (assets) will be very high.

The user can detect an imbalance, gives error rates for individual classes. To illustrate 20-dimensional synthetic data used. Class 1 is found in one spherical Gaussian, class 2 - in another. A training kit is created from 1000 classes 1 and 50 of class 2, together with a test set of 5000 classes 1 and 250 of class 2.

The final conclusion of the forest of 500 trees according to these data:

500 3.7 0.0 78.4

There is a low overall test case error (3.73%), but class 2 has more than 3/4 of its cases are mistakenly classified.

Error balancing can be done by setting different weights for classes.

The higher the weight specified by the class, the more its error rate decreases. A guide as to what weights to give is to make them inversely proportional to the class populations. . Thus, the set weights are 1 in classes 1 and 20 in class 2, and run again. Output:

500 12.1 12.7 0.0

Weight 20 to class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If an exact balance is required, the weight in the 2nd grade could have been a bit shaky.

Please note that upon receiving this balance, the overall error rate has increased. This is a common result - in order to get a better balance, the total bet error will be increased.

+4
source

This may seem like a trivial answer, but the best I can offer is to try a small subset of your data (small enough for the algorithm to move quickly), and watch how accurate you are when you use 1-1, 1-2, 1- 3 etc.

Build results as you gradually increase the total number of examples for each ratio and see how performance responds. Very often you will find that the data fractions are very close to the effectiveness of training on a complete data set, in which case you can make an informed decision on your issue.

Hope this helps.

+3
source

Source: https://habr.com/ru/post/950448/


All Articles