Reduce the Google Word2Vec Model with Gensim

Downloading a fully prepared word2vec model using Google is time-consuming and tedious, so I was wondering if there was a chance to delete words below a certain frequency in order to bring the count vocabto, for example, 200 thousand words.

I found the Word2Vec methods in the package gensimto determine the frequency of words and save the model again, but I'm not sure how to use pop/ removevocab from a pre-prepared model before saving again. I could not find any hint of KeyedVector classand Word2Vec classfor such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the dictionary of a pre-prepared word2vec model?

+4
source share
2 answers

The GoogleNews word vector file format does not include frequency information. But, it seems, it is sorted approximately in more frequent and less frequent orders.

And, it load_word2vec_format()offers an optional parameter limitthat only reads many vectors from a given file.

So, the following should do something you requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)
+4
source

Source: https://habr.com/ru/post/1670833/


All Articles