Sequential Fit Sklearn Random Forest

Question

Sequential Fit Sklearn Random Forest

I am training a random forest classifier in python using sklearn on the image data body. Since I am segmenting the image, I have to store the data of each pixel that ends with a huge matrix, for example, 100,000,000 of a long matrix of data points, so when I run the RF classifier on this matrix, my computer gets a memory full error and runs forever.

One idea that I had was to train the classifier in consecutive small batches of the data set, so, in the end, I trained in everything, but each time improved the fit of the classifier. Is this an idea that might work? Will fitting just undo the last match every time it is run?

+4

python scikit-learn machine-learning

Yoseph maguire Dec 13 '16 at 13:21

source share

1 answer

Serialdev · Accepted Answer · 2016-12-13T13:36:09+0000

You can use warm_startto pre-compute trees:

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)

Alternatively

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# Create 'n' random forests classifiers
rf_clf = [generate_rf(X_train, y_train, X_test, y_test) for i in range(n)]
# combine classifiers
rf_clf_combined = reduce(combine_rfs, rfs)

Sequential Fit Sklearn Random Forest

More articles: