Multi-label classification for a large dataset

I solve a multi-column classification problem. I have about 6 million lines to process, which are huge chunks of text. They are marked with several tags in a separate column.

Any advice on which scikit libraries can help me increase my code. I use One-vs-Rest and SVM inside it. But they do not scale beyond 90-100 thousand lines.

classifier = Pipeline([ ('vectorizer', CountVectorizer(min_df=1)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) 
+6
source share
2 answers

The SVM scale is good, as the number of columns increases, but badly with the number of rows, since they essentially study which rows constitute the support vectors. I saw this as a common complaint with SVM, but most people do not understand why, because they usually scale well for most reasonable datasets.

  • You will want 1 against the rest, as you use. One versus one will not scale well for this (n (n-1) classifiers, vs n).
  • I set the minimum df for terms that you think is at least 5, and maybe even higher, which will drastically reduce the size of your string. You will find many words that occur once or twice, and they do not add any value to your classification, as at this frequency, the algorithm cannot generalize. Liberation can help there.
  • Also remove the stop words (a, an, prepositions, etc., look at google). This will further reduce the number of columns.
  • As soon as you reduce the size of the column as described, I will try to eliminate some of the rows. If there are documents that are very noisy or very short after steps 1-3 or maybe a very long time, I would try to eliminate them. Look at the sd and average length of the document, and draw the length of the documents (in terms of the number of words) versus the frequency at that length to solve
  • If the dataset is still too large, I would suggest a decision tree or naive bays, both are present in sklearn. DT scale is very good. I would set a depth threshold to limit the depth of the tree, because otherwise he would try to grow a fatal tree to remember this dataset. NB, on the other hand, trains very fast and handles a large number of columns quite well. If DT works well, you can try RF with few trees and use ipython parallelization for multithreading.
  • Alternatively, segment your data into smaller data sets, train a classifier on each, save it to disk, and then create an ensemble classifier from these classifiers.
+3
source

HashingVectorizer will work if you iteratively exchange your data with packages of 10k or 100k documents that are suitable, for example, in memory.

You can then transfer the batch of converted documents to a linear classifier that supports the partial_fit method (for example, SGDClassifier or PassiveAggressiveClassifier ), and then iterate over new batches.

You can start to hammer the model according to the established validation set (for example, documents of 10 thousand), since you are going to control the accuracy of the partially trained model, without waiting for the appearance of all samples.

You can also do this in parallel on several machines on the data partitions, and then average the resulting coef_ and intercept_ attribute to get the final linear model for the entire data set.

I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736

This tutorial on scikit-learn paralysis with IPython.parallel contains sample code: https://github.com/ogrisel/parallel_ml_tutorial

0
source

Source: https://habr.com/ru/post/959029/


All Articles