Reduce the Google Word2Vec Model with Gensim

Question

Reduce the Google Word2Vec Model with Gensim

Downloading a fully prepared word2vec model using Google is time-consuming and tedious, so I was wondering if there was a chance to delete words below a certain frequency in order to bring the count vocabto, for example, 200 thousand words.

I found the Word2Vec methods in the package gensimto determine the frequency of words and save the model again, but I'm not sure how to use pop/ removevocab from a pre-prepared model before saving again. I could not find any hint of KeyedVector classand Word2Vec classfor such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the dictionary of a pre-prepared word2vec model?

+4

nlp gensim word2vec

neurix Feb 25 '17 at 17:38

source share

2 answers

/ - , Google?:)

https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models

, , Google , : https://groups.google.com/forum/#!topic/gensim/wkVhcuyj0Sg

, , .

https://github.com/RaRe-Technologies/gensim/pull/987

+3

Luke Barker 25 . '17 23:33

gojomo · Accepted Answer · 2017-02-27T09:44:02+0000

The GoogleNews word vector file format does not include frequency information. But, it seems, it is sorted approximately in more frequent and less frequent orders.

And, it load_word2vec_format()offers an optional parameter limitthat only reads many vectors from a given file.

So, the following should do something you requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)

Reduce the Google Word2Vec Model with Gensim

More articles: