Classification of text documents with random forests

Question

Classification of text documents with random forests

I have a set of 4k text documents. They belong to 10 different classes. I am trying to understand how a random forest method performs classification. The problem is that 200 thousand functions are extracted by my extraction class functions. Functions (a combination of words, bigrams, collocations, etc.) This is very rare data, and the random forest implementation in sklearn does not work with sparse data inputs.

Q. What are my options? Reduce the number of functions? How? Q. Is there any implementation of a random forest where they work with a sparse array.

My relevant code is as follows:

import logging import numpy as np from optparse import OptionParser import sys from time import time #import pylab as pl from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestClassifier from special_analyzer import * data_train = load_files(RAW_DATA_SRC_TR) data_test = load_files(RAW_DATA_SRC_TS) # split a training set and a test set y_train, y_test = data_train.target, data_test.target vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text X_train = vectorizer.fit_transform(data_train.data) rf = RandomForestClassifier(max_depth=10,max_features=10) rf.fit(X_train,y_train)

+5

python scikit-learn sparse-matrix random-forest

Yantra Feb 10 '14 at 22:22

source share

2 answers

ogrisel · Answer 1 · 2014-02-10T22:50:35+0000

A few parameters: use only the most 10000 most popular functions, passing max_features=10000 to CountVectorizer and convert the results to an array with dense numpy with an array method:

 X_train_array = X_train.toarray()

Otherwise, reduce the dimension to 100 or 300 with:

 pca = TruncatedSVD(n_components=300) X_reduced_train = pca.fit_transform(X_train)

However, in my experience, I could never improve RF performance than a well-tuned linear model (for example, logistic regression with the grid regularization parameter) on the initial sparse data (possibly with TF-IDF normalization).

Andrew Cassidy · Answer 2 · 2014-02-10T22:29:54+0000

Option 1: "If the number of variables is very large, forests can be run once with all the variables, and then run again using only the most important variables from the first run."

from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp

I'm not sure if the random forest in sklearn has a function importance parameter. A random forest in R realizes an average decrease in Gini impurity, as well as an average decrease in accuracy.

Option 2: Reduce the dimension. Use a PCA or other downsizing technique to change a dense matrix from N dimensions to a smaller matrix, and then use this less sparse matrix for the classification problem

Option 3: Discard correlated functions. I believe that a random forest should be more resilient to correlated functions than a multinational logistic regression. That being said, it may be that you have a number of interconnected functions. If you have many paired correlated variables, you can drop one of the two variables, and theoretically you will not lose the “predictive power”. In addition to pair correlation, there are also numerous correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor

Classification of text documents with random forests

More articles: