Combining pre-prepared models in Word2Vec?

I have the 100 millionth textual vector news content of Google News. In addition, I also train my own 3gb data by creating another pre-processed vector file. Both have 300 functional sizes and more than 1 GB.

How to combine these two huge pre-prepared vectors? or how can I train a new model and update vectors on top of another? I see that C-based word2vec does not support batch learning.

I want to calculate the analogy of words from these two models. I believe that vectors extracted from these two sources will get pretty good results.

+6
source share
2 answers

There is no easy way to combine the end results of individual training sessions.

Even for the same data, a little randomization from the initial seeding or jitter flow planning will lead to different final states, which makes the vectors completely comparable within the same session.

This is because each session finds a useful configuration of vectors ... but there are many equally useful configurations, and not one best.

For example, any final state that you reach has many turns / reflections, which can be just as good in the task of predicting workouts or perform exactly the same in some other task (for example, to solve analogies). But most of these possible alternatives will not have coordinates that can be mixed and matched for useful comparisons with each other.

Preloading your model with data from previous training runs can improve results after more intensive training with new data, but I do not know about thorough testing of this feature. The effect probably depends on your specific goals, the choice of parameters and how much the new and old data are similar, or represents the possible data against which the vectors will be used.

For example, if the Google News body is different from your own learning data, or the text that you will use word vectors to understand, using it as a starting point can simply slow down or evade your learning. On the other hand, if you train your new data long enough, in the end, any influence of preloaded values ​​can be diluted to nonexistence. (If you really need a “mixed” result, you may need to train new data at the same time with an alternating goal to push the vectors back to the values ​​of the previous data set.)

Ways to combine the results of independent sessions can be a good research project. Perhaps the method used in translation projects of the word2vec language translation - studying the projection between the spaces of the vocabulary, can also "translate" between different coordinates of different runs. Perhaps blocking some vectors in place or training in dual goals to “predict new text” and “stay close to old vectors” will give significantly improved combined results.

+11
source

These are my methods:

  • Download lessons from Google News and combine them into your data, then prepare them!

  • Divide your data set into 2 data sets of equal size , then prepare both of them. You now have 3 models , so you can use the blending method to predict.

I hope this can help you!

+3
source

Source: https://habr.com/ru/post/988012/


All Articles