What training algorithm should be considered for training a log-linear regression model?

I need to prepare a regression model for a large set of training examples, with the ability to include arbitrary functions. What training algorithms should I consider and why?

Short description of the problem:

  • Approximately 5 million case studies
  • Adding case studies at a rate of 2-4 million per year
  • The case studies currently contain 10 functions.
  • Approximately 400 thousand functions filled out (from a much larger total space of possibilities).
  • Additional features added over time
  • Retraining or adapting the model (at least) daily to include new examples.
  • Optimization Criteria: Minimum Square Percentage Accuracy
  • Output: one real number

I have some experience in teaching log-linear models on classification problems of the same size (using SVM, averaged and voice perceptrons, etc.). The ability to add arbitrary functions is important, but in this case, the learning time is also valuable.

For example, my only experiment so far with SVMLight has taken several weeks to converge on a subset of this data. We could parallelize a multicore machine or (possibly) a cluster, but we need to train models in a matter of minutes. Online training will be even better.

I have successfully trained the averaged perceptron model (and fast). However, as far as I know, AP usually does not apply to regression. Does the AP provide any convergence guarantees for the regression model? Is there any other formal reason that should not be applied? Or is it a reasonable fit to my requirements?

What other options should I explore? SVM is likely to provide excellent accuracy, but quadratic training time is unacceptable. If linear time SVM algorithms are available, this may work well.

Potential advantages:

  • Online Learning
  • An open source implementation is available (ideally in Java). If necessary, we can perform our own implementation, but if possible, I will avoid it.

Thanks for your input.

+6
source share
1 answer

This is a classic problem with large-scale SVM. The SVM model will need to be retrained if new features are added, and if new data is added if you are not using online svm. Some options:

Practical parameters (from the shelf):

LIBLINEAR . If you can use Linear SVM, there are some algorithms that use a linear kernel to provide better than quadratic training time. Check out LIBLINEAR, which is in the same research group as libsvm. They simply added regression to version 1.91 released yesterday. http://www.csie.ntu.edu.tw/~cjlin/liblinear/

Oracle ODM Oracle has access to SVM in its ODM package. They take a practical approach to basically provide a β€œreasonably good” SVM without paying the computational cost of finding a truly optimal solution. They use some methods of selecting and selecting models - you can find information about this here: http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/overview/support-vector-machines-paper-1205 -129825.pdf

SHOGUN . SHOGUN Machine Learning Toolbox is designed for large-scale training, it interacts with several implementations of SVM, as well as other methods. I have never used it, but it might be worth a look: http://www.shogun-toolbox.org

Kernel-machines.org has a list of software packages: http://www.kernel-machines.org/software

Other SVM studies

If you want to collapse your own, there are many methods to try to scale SVM to large datasets that have been published in scientific documents, but the code is not necessarily available, available or supported as the above examples. They claim good results, but each of them has its own drawbacks. Many of them are related to the choice of data level. For example, several research papers use linear time clustering algorithms to cluster data and build sequential cluster-based SVM models to build a model without using all the data. Core Vector Machines claim linear training time, but there is some criticism as to how high their accuracy is. Numerous documents use various heuristic algorithms to try to select the most likely candidates for vector support. Many of them relate to classification, but can probably be adapted to regression. If you need more information about some of these studies, I can add some links.

Algorithm Learning Tools

You probably already know them, but I decided that I would drop him here just in case:

There are other algorithms that have good runtime on large data sets, but whether they will work well is difficult to say, it depends on the composition of your data. Since runtime is important, I would start with simpler models and work up to more complex ones. ANN, decision tree regression, Bayesian methods, locally weighted linear regression, or a hybrid approach, such as model trees, which is a decision tree whose leaf nodes are linear models, can be performed more quickly than SVMs on large data sets and can produce nice results.

WEKA - Weka is a good tool to explore your possibilities. I would use WEKA to try subsets of your data in different algorithms. Source code is also open in java if you choose something that you can tailor to your needs. http://www.cs.waikato.ac.nz/ml/weka/

R. The programming language R also implements many algorithms and is similar to programming in Matlab. http://www.r-project.org/

I would not recommend using WEKA or R rather than a large-scale dataset, but they are useful tools to narrow down which algorithms may work well for you.

+7
source

Source: https://habr.com/ru/post/914078/


All Articles