Why does the Sklearn RandomForest model take up a lot of disk space after saving?

I save the RandomForestClassifier model from the sklearn library with the code below

with open('/tmp/rf.model', 'wb') as f: cPickle.dump(RF_model, f) 

This takes up a lot of space on my hard drive. There are only 50 trees in the model, but on the disk it occupies more than 50 MB (the analyzed data set is ~ 20 MB, with 21 functions). Does anyone have an idea why? I observe a similar behavior for ExtraTreesClassifier.

Edit: Radio frequency parameters:

 "n_estimators": 50, "max_features": 0.2, "min_samples_split": 20, "criterion": "gini", "min_samples_leaf": 11 

As suggested by @dooms, I checked sys.getsizeof and it returns 64 - I assume this is just the size of the pointer.

I tried another way to save the model:

 from sklearn.externals import joblib joblib.dump(RF_model, 'filename.pkl') 

Using this method, I get a 1 * .pkl file and 201 * .npy files with a total size of 14.9 MB, therefore less than the previous 53 MB. These 201 npy files have a template: in the forest there are 4 files per tree:

The contents of the first file (231 KB):

 array([(1, 1062, 20, 0.2557438611984253, 0.4997574055554296, 29168, 46216.0), (2, 581, 12, 0.5557271242141724, 0.49938159451291675, 7506, 11971.0), (3, 6, 14, 0.006186043843626976, 0.4953095968671224, 4060, 6422.0), ..., (4123, 4124, 15, 0.6142271757125854, 0.4152249134948097, 31, 51.0), (-1, -1, -2, -2.0, 0.495, 11, 20.0), (-1, -1, -2, -2.0, 0.3121748178980229, 20, 31.0)], dtype=[('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]) 

The contents of the second file (66 kB):

 array([[[ 2.25990000e+04, 2.36170000e+04]], [[ 6.19600000e+03, 5.77500000e+03]], [[ 3.52200000e+03, 2.90000000e+03]], ..., [[ 3.60000000e+01, 1.50000000e+01]], [[ 1.10000000e+01, 9.00000000e+00]], [[ 2.50000000e+01, 6.00000000e+00]]]) 

Third file (88B):

 array([2]) 

Last file from the group (96B):

 array([ 0., 1.]) 

Any ideas what this is? I tried to look into the tree code in sklearn, but it is difficult. Any ideas how to save a sklearn tree that stores less disk? (just point out that a similar xgboost ensemble size takes ~ 200 KB total size)

+7
source share
1 answer

I saw the same behavior using landfill dumps. Reset is about 10 times the size in memory .

 from sys import getsizeof memory_size = getsizeof(RF_model) 

See if there is a huge difference, and if so, see another way to save your model.

-1
source

Source: https://habr.com/ru/post/1246797/


All Articles