Are Unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a linear regression question with ngrams using Tf-IDF (term frequency is the inverse frequency of a document). For this, I use numpy sparse matrices and sklearn for linear regression.

I have 53 cases and more than 6,000 functions when using unigrams. Predictions are based on cross validation using the LeaveOneOut method.

When I create a sparse tf-idf matrix only for unigram estimates, I get slightly better predictions than when I create a sparse tf-idf matrix for unigram + bigrams. The more columns I add to the matrix (columns for trigrams, quadrams, quintrams, etc.), the more accurate the regression prediction.

Is this common? How is this possible? I would have thought that the more features, the better.

+3
source share
2 answers

It’s not very common for bigrams to work worse than for unigrams, but there are situations when this can happen. In particular, the addition of additional features may lead to retraining. Tf-idf is unlikely to alleviate this, since longer n-grams will be rarer, resulting in higher idf values.

I’m not sure which variable you are trying to predict, and I never did any regression on the text, but here are some comparable results from the literature, so you think:

  • When generating random texts with small (but non-trivial) learning sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause a full outfit, while trigrams more often generate β€œnew”, but still somewhat grammatical / recognizable text (see Yurafsky and Martin , I don’t remember which chapter I don’t have my copy of.)
  • In NLP classification-style tasks performed with kernels, quadratic kernels tend to live better than cubic ones because the latter often overlap with the training set. Note that the unigram + bigram functions can be considered as a subset of the quadratic space of the kernel functions and the {1,2,3} -grams of the cubic kernel.

What happens depends on your set of workouts; it may just be too little.

+11
source

As Larsmans said, adding more variables / functions simplifies the simulation, so the accuracy of the test is lost. The scikit-learn branch now has a min_df parameter to disable any function with less than this number of occurrences. Therefore, min_df==2 to min_df==5 can help you get rid of false bigrams.

Alternatively, you can use linear regression (or classification) of L1 or L1 + L2 using the following classes:

  • sklearn.linear_model.Lasso (regression)
  • sklearn.linear_model.ElasticNet (regression)
  • sklearn.linear_model.SGDRegressor (regression) with a fine == 'elastic_net' or 'l1'
  • sklearn.linear_model.SGDClassifier (classification) with a penalty == 'elastic_net' or 'l1'

This will ignore false functions and lead to a sparse model with many zero weights for noisy functions. Grid The search for regularization parameters will be very important.

You can also try one-dimensional selection of functions, for example, make a classification text to learn about scikit-learn utilities (check the SelectKBest and chi2 .

+8
source

Source: https://habr.com/ru/post/1432075/


All Articles