Feature Selection with LinearSVC

When I try to run the following code with my data (from this example )

X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y) 

I get:

 "Invalid threshold: all features are discarded" 

I tried specifying my own threshold:

 clf = LinearSVC(C=0.01, penalty="l1", dual=False) clf.fit(X,y) X_new = clf.transform(X, threshold=my_threshold) 

but I either get:

  • An X_new array of the same size as X is when my_threshold is one of:

    • 'mean'
    • 'median'
  • Or an "Invalid threshold" error (for example, when passing scalar values ​​to a threshold)

I cannot publish the whole matrix X , but a few statistics are given below:

 > X.shape Out: (29,312) > np.mean(X, axis=1) Out: array([-0.30517191, -0.1147345 , 0.03674294, -0.15926932, -0.05034101, -0.06357734, -0.08781186, -0.12865185, 0.14172452, 0.33640029, 0.06778798, -0.00217696, 0.09097335, -0.17915627, 0.03701893, -0.1361117 , 0.13132006, 0.14406628, -0.05081956, 0.20777349, -0.06028931, 0.03541849, -0.07100492, 0.05740661, -0.38585413, 0.31837905, 0.14076042, 0.1182338 , -0.06903557]) > np.std(X, axis=1) Out: array([ 1.3267662 , 0.75313658, 0.81796146, 0.79814621, 0.59175161, 0.73149726, 0.8087903 , 0.59901198, 1.13414141, 1.02433752, 0.99884428, 1.11139231, 0.89254901, 1.92760784, 0.57181158, 1.01322265, 0.66705546, 0.70248779, 1.17107696, 0.88254386, 1.06930436, 0.91769016, 0.92915593, 0.84569395, 1.59371779, 0.71257806, 0.94307434, 0.95083782, 0.88996455]) y = array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0]) 

That's all with scikit-learn 0.14 .

+4
source share
1 answer

First you need to analyze whether your SVM model is well trained before trying to use it as a transformation base. Perhaps you are using a too small C parameter , which causes sklearn to train a trivial model that sklearn all functions. You can verify this by performing classification tests on your data or at least by printing the coefficients found ( clf.coef_ )

It would be nice to run the grid search method for better C in terms of generalization properties, and then use it for conversion.

+4
source

Source: https://habr.com/ru/post/1496841/


All Articles