Extracurricular Preparation Scikit LinearSVC Classifier

Question

Extracurricular Preparation Scikit LinearSVC Classifier

How do you train Scikit LinearSVC in a dataset that is too large or impractical to fit into memory? I try to use it to classify documents, and I have several thousand marked examples of records, but when I try to load all this text into memory and train LinearSVC, it consumes more than 65% of my memory and I have to kill it before mine the system will stop responding.

Is it possible to format my training data as a single file and transfer it to LinearSVC with the file name instead of calling the fit() method?

I found this guide, but it really does cover the classification and assumes that the training is gradual, something is not supported by LinearSVC.

+4

python scikit-learn

Cerin Aug 29 '13 at 17:13

source share

2 answers

Paul · Answer 1 · 2013-08-29T18:38:02+0000

As far as I know, for non-incremental implementations like LinearSVC, you need a whole dataset for training. If you do not create an incremental version, you may not use LinearSVC.

There are classifiers in scikit-learn that can be used gradually, just like in the manual that you found when it used SGDClassifier. SGDClassifier has a * partial_fit * method that allows you to train it in batches. There are several other classifiers that support incremental learning, such as SGDCLassifier , Multinomial Naive Bayes and Bernoulli Naive Bayes

Nikhil Agarwal · Answer 2 · 2013-11-06T05:08:14+0000

You can use a generator function like this.

 def lineGenerator(): with open(INPUT_FILENAMES_TITLE[0],'r') as f1: title_reader = csv.reader(f1) for line in title_reader: yield line[0]

Then you can call the classifier as

 clf = LinearSVC() clf.fit(lineGenerator())

This assumes that INPUT_FILENAMES_TITLE [0] is your file name.

Extracurricular Preparation Scikit LinearSVC Classifier

More articles: