Extracurricular Preparation Scikit LinearSVC Classifier

How do you train Scikit LinearSVC in a dataset that is too large or impractical to fit into memory? I try to use it to classify documents, and I have several thousand marked examples of records, but when I try to load all this text into memory and train LinearSVC, it consumes more than 65% of my memory and I have to kill it before mine the system will stop responding.

Is it possible to format my training data as a single file and transfer it to LinearSVC with the file name instead of calling the fit() method?

I found this guide, but it really does cover the classification and assumes that the training is gradual, something is not supported by LinearSVC.

+4
source share
2 answers

As far as I know, for non-incremental implementations like LinearSVC, you need a whole dataset for training. If you do not create an incremental version, you may not use LinearSVC.

There are classifiers in scikit-learn that can be used gradually, just like in the manual that you found when it used SGDClassifier. SGDClassifier has a * partial_fit * method that allows you to train it in batches. There are several other classifiers that support incremental learning, such as SGDCLassifier , Multinomial Naive Bayes and Bernoulli Naive Bayes

+4
source

You can use a generator function like this.

 def lineGenerator(): with open(INPUT_FILENAMES_TITLE[0],'r') as f1: title_reader = csv.reader(f1) for line in title_reader: yield line[0] 

Then you can call the classifier as

 clf = LinearSVC() clf.fit(lineGenerator()) 

This assumes that INPUT_FILENAMES_TITLE [0] is your file name.

-one
source

Source: https://habr.com/ru/post/1499666/


All Articles