Updating learning data for supervised learning - how?

We have a classifier for web pages. A classifier model was built with train data about 2 years ago. We noticed that the performance of the model continues to deteriorate, and we assume that its properties of web pages change over time (mostly words and terminology are used, as well as topology, html tags, etc.).

How do you approach this problem? are we just rebuilding all the train data and relearning the new model? Is there a shortcut? Are there some common practices or documents on how to do this? Please note that we are quite attached to the approach to supervised learning, when system administrators train the classifier, evaluate its performance on a test set, and then install the classifier in the "production" system.

Hope this is not too vague ...

+5
source share
2 answers

There are a number of factors that can be taken into account, the main ones being the state of the classifier and the data.

If you do not need any new inputs as a result of changing the web protocols, you can relearn your existing classifier to the latest data.

If the classifier was not intended to be retrained according to new data, it can be difficult to save the old model. Similarly, if the inputs or outputs have changed, it may also be easier to create a new classifier.

I don’t know what classifier you are using, or means for retraining or processing your data, so I can’t provide a direct answer to the problem you are facing, or if there are any shortcuts for this problem. It really comes down to how accessible your classifier is and the cost of maintaining it.

As stated in your previous question, it would be recommended to test and compare the new classifier to make sure that it meets the requirements before applying it to the working environment.

+2
source

If you use the standard, there is probably no way outside the classifier of the shelf to update the parameters for new data (it depends on what you are using). Recovering from scratch is probably the fastest way. If you go down this route, consider including old data and some new data, perhaps by weighing the new data above (weighted loss functions can do this). Storing old data is likely to minimize the amount of new data that needs to be created.

If you want to constantly update your model in the light of new data (i.e. if it will be a recurring problem), consider switching to a classifier that supports online learning out of the box. The obvious choice would be one of the passive aggressive families of training methods: MIRA is pretty good (it's mostly online SVM).

+2
source

Source: https://habr.com/ru/post/1202674/


All Articles