As Larsmans said, adding more variables / functions simplifies the simulation, so the accuracy of the test is lost. The scikit-learn branch now has a min_df parameter to disable any function with less than this number of occurrences. Therefore, min_df==2 to min_df==5 can help you get rid of false bigrams.
Alternatively, you can use linear regression (or classification) of L1 or L1 + L2 using the following classes:
- sklearn.linear_model.Lasso (regression)
- sklearn.linear_model.ElasticNet (regression)
- sklearn.linear_model.SGDRegressor (regression) with a fine == 'elastic_net' or 'l1'
- sklearn.linear_model.SGDClassifier (classification) with a penalty == 'elastic_net' or 'l1'
This will ignore false functions and lead to a sparse model with many zero weights for noisy functions. Grid The search for regularization parameters will be very important.
You can also try one-dimensional selection of functions, for example, make a classification text to learn about scikit-learn utilities (check the SelectKBest and chi2 .
source share