I study random forests in scikit learning, and as an example, I would like to use a random forest classifier to classify text with my own dataset. So first, I vectorized the text using tfidf and for classification:
from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train) prediction = classifier.predict(X_test)
When I run the classification, I got the following:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
then I used .toarray() for X_train and I got the following:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
From the previous question , as I understand it, I need to reduce the dimension of the numpy array so that I do the same:
from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=300) X_reduced_train = pca.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_reduced_train, y_train) prediction = classifier.predict(X_testing)
Then I got this exception:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__ raise TypeError("sparse matrix length is ambiguous; use getnnz()" TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
I tried the following:
prediction = classifier.predict(X_train.getnnz())
And got the following:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) TypeError: object of type 'int' has no len()
Two questions were raised from this: how can I use random forests for proper classification? and what happens to X_train ?
Then I tried the following:
df = pd.read_csv('/path/file.csv', header=0, sep=',', names=['id', 'text', 'label']) X = tfidf_vect.fit_transform(df['text'].values) y = df['label'].values from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=2) X = pca.fit_transform(X) a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(a_train, b_train) prediction = classifier.predict(a_test) from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report print '\nscore:', classifier.score(a_train, b_test) print '\nprecision:', precision_score(b_test, prediction) print '\nrecall:', recall_score(b_test, prediction) print '\n confussion matrix:\n',confusion_matrix(b_test, prediction) print '\n clasification report:\n', classification_report(b_test, prediction)