TypeError: sparse matrix length is ambiguous; to use getnnz () or form [0] when using the RF classifier?

I study random forests in scikit learning, and as an example, I would like to use a random forest classifier to classify text with my own dataset. So first, I vectorized the text using tfidf and for classification:

from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train) prediction = classifier.predict(X_test) 

When I run the classification, I got the following:

 TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. 

then I used .toarray() for X_train and I got the following:

 TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] 

From the previous question , as I understand it, I need to reduce the dimension of the numpy array so that I do the same:

 from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=300) X_reduced_train = pca.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_reduced_train, y_train) prediction = classifier.predict(X_testing) 

Then I got this exception:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__ raise TypeError("sparse matrix length is ambiguous; use getnnz()" TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] 

I tried the following:

 prediction = classifier.predict(X_train.getnnz()) 

And got the following:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) TypeError: object of type 'int' has no len() 

Two questions were raised from this: how can I use random forests for proper classification? and what happens to X_train ?

Then I tried the following:

 df = pd.read_csv('/path/file.csv', header=0, sep=',', names=['id', 'text', 'label']) X = tfidf_vect.fit_transform(df['text'].values) y = df['label'].values from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=2) X = pca.fit_transform(X) a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(a_train, b_train) prediction = classifier.predict(a_test) from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report print '\nscore:', classifier.score(a_train, b_test) print '\nprecision:', precision_score(b_test, prediction) print '\nrecall:', recall_score(b_test, prediction) print '\n confussion matrix:\n',confusion_matrix(b_test, prediction) print '\n clasification report:\n', classification_report(b_test, prediction) 
+6
source share
2 answers

It’s a little unclear whether you pass the same data structure (type and form) to the fit and predict classifier method. Random forests will take a long time to work with a large number of functions, therefore, a proposal to reduce the dimension in the message you are referring to.

You must apply SVD to both training and test data so that the classifier learns on the same formatted input as the data you want to predict. Check the input for compliance, and the input of the prediction method has the same number of functions, and they are both arrays and not sparse matrices.

updated with an example: updated to use dataframe

 from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vect= TfidfVectorizer( use_idf=True, smooth_idf=True, sublinear_tf=False) from sklearn.cross_validation import train_test_split df= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\ ,'class': [0,0,0,1,1,1,0,3]}) X = tfidf_vect.fit_transform(df['text'].values) y = df['class'].values from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=2) X_reduced_train = pca.fit_transform(X) a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(a_train.toarray(), b_train) prediction = classifier.predict(a_test.toarray()) 

Please note that SVD occurs before separation into training and test sets, so the array passed to the predictor has the same n as the array on which the fit method is called.

+1
source

I don't know much about sklearn , although I vaguely recall a previous problem caused by switching to using sparse matrices. Internally, some of the matrices had to be replaced with m.toarray() or m.todense() .

But to give you an idea of ​​what the error message was, consider

 In [907]: A=np.array([[0,1],[3,4]]) In [908]: M=sparse.coo_matrix(A) In [909]: len(A) Out[909]: 2 In [910]: len(M) ... TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] In [911]: A.shape[0] Out[911]: 2 In [912]: M.shape[0] Out[912]: 2 

len() commonly used in Python to count the number of members of the first level of a list. When applied to a 2d array, this is the number of rows. But A.shape[0] is the best way to count lines. And M.shape[0] is the same. In this case, you are not interested in .getnnz , which is the number of nonzero terms of the sparse matrix. A does not have this method, although it can be obtained from A.nonzero() .

+6
source

Source: https://habr.com/ru/post/981958/


All Articles