Why scikit-learn random forest using so much memory?

Question

Why scikit-learn random forest using so much memory?

I am using scikit Random Forest:

sklearn.ensemble.RandomForestClassifier(n_estimators=100, max_features="auto", max_depth=10)

After calling rf.fit(...) , the process memory usage increases by 80 MB or 0.8 MB per tree (I also tried many other settings with similar results. I used top and psutil to control memory usage)

A binary tree of depth 10 should contain no more than 2^11-1 = 2047 elements that can be stored in one dense array, which allows the programmer to easily find the parents and children of any element.

Each element needs an index of the function used in split and cut-off, or 6-16 bytes, depending on how economical the programmer is. In my case, this means 0.01-0.03 MB per tree.

Why does the scikit implementation use 20-60 times the amount of memory to store a random forest tree?

+6

scikit-learn machine-learning classification random-forest decision-tree

Maxb Dec 6 '13 at 0:03

source share

1 answer

ogrisel · Accepted Answer · 2013-12-06T00:56:29+0000

Each solution (non-leaf) node stores integer indices left and right (2 x 8 bytes), the index of the function used to separate (8 bytes), the value of the threshold float for the solution (8 bytes), and the reduction of impurities (8 bytes). In addition, leaf nodes maintain a constant target value predicted by the leaf.

You can see the definition of the Cython class in the source code for details.

Why scikit-learn random forest using so much memory?

More articles: