Error "utf-8" when loading the word2vec module

Question

Error "utf-8" when loading the word2vec module

I need to use the word2vec module containing tons of Chinese characters. The module was prepared by my colleagues using Java and saved as a bin file.

I installed gensim and trying to load the module, but an error occurred:

In [1]: import gensim  

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

I tried loading the module in both python 2.7 and 3.5, and failed. So how can I load a module in gensim? Thanks.

+4

python nlp gensim word2vec

zfz Dec 23 '15 at 2:24

source share

2 answers

unicode_errors='ignore'

unicode.

, unicode , filename.bin.gz filename.gz.

- , .

, unicode.

, Mac (Sierra) python 2.7.

+1

theteddyboy 27 . '17 9:55

zfz · Accepted Answer · 2015-12-24T04:57:21+0000

There were many Chinese characters taught by Java in the module. I can not understand the encoding format of the original case. The error can be resolved as a description in the gensim FAQ ,

load_word2vec_format :

In [1]: import gensim

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')

, , .

Error "utf-8" when loading the word2vec module

More articles: